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ETAPS Foreword 


Welcome to the 25th ETAPS! ETAPS 2022 took place in Munich, the beautiful capital 
of Bavaria, in Germany. 

ETAPS 2022 is the 25th instance of the European Joint Conferences on Theory and 
Practice of Software. ETAPS is an annual federated conference established in 1998, 
and consists of four conferences: ESOP, FASE, FoSSaCS, and TACAS. Each 
conference has its own Program Committee (PC) and its own Steering Committee 
(SC). The conferences cover various aspects of software systems, ranging from theo- 
retical computer science to foundations of programming languages, analysis tools, and 
formal approaches to software engineering. Organizing these conferences in a coherent, 
highly synchronized conference program enables researchers to participate in an 
exciting event, having the possibility to meet many colleagues working in different 
directions in the field, and to easily attend talks of different conferences. On the 
weekend before the main conference, numerous satellite workshops took place that 
attract many researchers from all over the globe. 

ETAPS 2022 received 362 submissions in total, 111 of which were accepted, 
yielding an overall acceptance rate of 30.7%. I thank all the authors for their interest in 
ETAPS, all the reviewers for their reviewing efforts, the PC members for their con- 
tributions, and in particular the PC (co-)chairs for their hard work in running this entire 
intensive process. Last but not least, my congratulations to all authors of the accepted 
papers! 

ETAPS 2022 featured the unifying invited speakers Alexandra Silva (University 
College London, UK, and Cornell University, USA) and Tomas Vojnar (Brno 
University of Technology, Czech Republic) and the conference-specific invited 
speakers Nathalie Bertrand (Inria Rennes, France) for FoSSaCS and Lenore Zuck 
(University of Illinois at Chicago, USA) for TACAS. Invited tutorials were provided by 
Stacey Jeffery (CWI and QuSoft, The Netherlands) on quantum computing and 
Nicholas Lane (University of Cambridge and Samsung AI Lab, UK) on federated 
learning. 

As this event was the 25th edition of ETAPS, part of the program was a special 
celebration where we looked back on the achievements of ETAPS and its constituting 
conferences in the past, but we also looked into the future, and discussed the challenges 
ahead for research in software science. This edition also reinstated the ETAPS men- 
toring workshop for PhD students. 

ETAPS 2022 took place in Munich, Germany, and was organized jointly by the 
Technical University of Munich (TUM) and the LMU Munich. The former was 
founded in 1868, and the latter in 1472 as the 6th oldest German university still running 
today. Together, they have 100,000 enrolled students, regularly rank among the top 
100 universities worldwide (with TUM's computer-science department ranked £1 in 
the European Union), and their researchers and alumni include 60 Nobel laureates. 
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The local organization team consisted of Jan Křetínský (general chair), Dirk Beyer 
(general, financial, and workshop chair), Julia Eisentraut (organization chair), and 
Alexandros Evangelidis (local proceedings chair). 

ETAPS 2022 was further supported by the following associations and societies: 
ETAPS e.V., EATCS (European Association for Theoretical Computer Science), 
EAPLS (European Association for Programming Languages and Systems), and EASST 
(European Association of Software Science and Technology). 

The ETAPS Steering Committee consists of an Executive Board, and representa- 
tives of the individual ETAPS conferences, as well as representatives of EATCS, 
EAPLS, and EASST. The Executive Board consists of Holger Hermanns 
(Saarbrücken), Marieke Huisman (Twente, chair), Jan Kofron (Prague), Barbara König 
(Duisburg), Thomas Noll (Aachen), Caterina Urban (Paris), Tarmo Uustalu (Reykjavik 
and Tallinn), and Lenore Zuck (Chicago). 

Other members of the Steering Committee are Patricia Bouyer (Paris), Einar Broch 
Johnsen (Oslo), Dana Fisman (Be'er Sheva), Reiko Heckel (Leicester), Joost-Pieter 
Katoen (Aachen and Twente), Fabrice Kordon (Paris), Jan Křetínský (Munich), Orna 
Kupferman (Jerusalem), Leen Lambers (Cottbus), Tiziana Margaria (Limerick), 
Andrew M. Pitts (Cambridge), Elizabeth Polgreen (Edinburgh), Grigore Rosu (Illinois), 
Peter Ryan (Luxembourg), Sriram Sankaranarayanan (Boulder) Don Sannella 
(Edinburgh), Lutz Schróder (Erlangen), Ilya Sergey (Singapore), Natasha Sharygina 
(Lugano), Pawel Sobocinski (Tallinn), Peter Thiemann (Freiburg), Sebastian Uchitel 
(London and Buenos Aires), Jan Vitek (Prague), Andrzej Wasowski (Copenhagen), 
Thomas Wies (New York), Anton Wijs (Eindhoven), and Manuel Wimmer (Linz). 

I'd like to take this opportunity to thank all authors, attendees, organizers of the 
satellite workshops, and Springer-Verlag GmbH for their support. I hope you all 
enjoyed ETAPS 2022. 

Finally, a big thanks to Jan, Julia, Dirk, and their local organization team for all their 
enormous efforts to make ETAPS a fantastic event. 


February 2022 Marieke Huisman 
ETAPS SC Chair 
ETAPS e.V. President 


Preface 


TACAS 2022 was the 28th edition of the International Conference on Tools and 
Algorithms for the Construction and Analysis of Systems. TACAS 2022 was part of the 
25th European Joint Conferences on Theory and Practice of Software (ETAPS 2022), 
which was held from April 2 to April 7 in Munich, Germany, as well as online due to the 
COVID-19 pandemic. TACAS is a forum for researchers, developers, and users inter- 
ested in rigorous tools and algorithms for the construction and analysis of systems. The 
conference aims to bridge the gaps between different communities with this common 
interest and to support them in their quest to improve the utility, reliability, flexibility, 
and efficiency of tools and algorithms for building computer-controlled systems. 
There were four submission categories for TACAS 2022: 
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. Research papers advancing the theoretical foundations for the construction and 
analysis of systems. 
2. Case study papers with an emphasis on a real-world setting. 
3. Regular tool papers presenting a new tool, a new tool component, or novel 
extensions to an existing tool. 
4. Tool demonstration papers focusing on the usage aspects of tools. 


Papers of categories 1—3 were restricted to 16 pages, and papers of category 4 to six 
pages. 

This year 159 papers were submitted to TACAS, consisting of 112 research papers, 
five case study papers, 33 regular tool papers, and nine tool demo papers. Authors were 
allowed to submit up to four papers. Each paper was reviewed by three Program 
Committee (PC) members, who made use of subreviewers. Similarly to previous years, 
it was possible to submit an artifact alongside a paper, which was mandatory for regular 
tool and tool demo papers. 

An artifact might consist of a tool, models, proofs, or other data required for vali- 
dation of the results of the paper. The Artifact Evaluation Committee (AEC) was tasked 
with reviewing the artifacts based on their documentation, ease of use, and, most 
importantly, whether the results presented in the corresponding paper could be accu- 
rately reproduced. Most of the evaluation was carried out using a standardized virtual 
machine to ensure consistency of the results, except for those artifacts that had special 
hardware or software requirements. The evaluation consisted of two rounds. The first 
round was carried out in parallel with the work of the PC. The judgment of the AEC 
was communicated to the PC and weighed in their discussion. The second round took 
place after paper acceptance notifications were sent out; authors of accepted research 
papers who did not submit an artifact in the first round could submit their artifact at this 
time. In total, 86 artifacts were submitted (79 in the first round and seven in the second) 
and evaluated by the AEC regarding their availability, functionality, and/or reusability. 
Papers with an artifact that was successfully evaluated include one or more badges on 
the first page, certifying the respective properties. 
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Selected authors were requested to provide a rebuttal for both papers and artifacts in 
case a review gave rise to questions. Using the review reports and rebuttals, the 
Program and the Artifact Evaluation Committees extensively discussed the papers and 
artifacts and ultimately decided to accept 33 research papers, one case study, 12 tool 
papers, and four tool demos. 

This corresponds to an acceptance rate of 29.46% for research papers and an overall 
acceptance rate of 31.44%. 

Besides the regular conference papers, this two-volume proceedings also contains 
16 short papers that describe the participating verification systems and a competition 
report presenting the results of the 11th SV-COMP, the competition on automatic 
software verifiers for C and Java programs. These papers were reviewed by a separate 
Program Committee (PC); each of the papers was assessed by at least three reviewers. 
A total of 47 verification systems with developers from 11 countries entered the sys- 
tematic comparative evaluation, including four submissions from industry. Two ses- 
sions in the TACAS program were reserved for the presentation of the results: (1) a 
summary by the competition chair and of the participating tools by the developer teams 
in the first session, and (2) an open community meeting in the second session. 

We would like to thank all the people who helped to make TACAS 2022 successful. 
First, we would like to thank the authors for submitting their papers to TACAS 2022. 
The PC members and additional reviewers did a great job in reviewing papers: they 
contributed informed and detailed reports and engaged in the PC discussions. We also 
thank the steering committee, and especially its chair, Joost-Pieter Katoen, for his 
valuable advice. Lastly, we would like to thank the overall organization team of 
ETAPS 2022. 
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Abstract. A continuous-time Markov chain (CTMC) execution is a con- 
tinuous class of probability distributions over states. This paper proposes 
a probabilistic linear-time temporal logic, namely continuous-time linear 
logic (CLL), to reason about the probability distribution execution of 
CTMCs. We define the syntax of CLL on the space of probability dis- 
tributions. The syntax of CLL includes multiphase timed until formulas, 
and the semantics of CLL allows time reset to study relatively temporal 
properties. We derive a corresponding model-checking algorithm for CLL 
formulas. The correctness of the model-checking algorithm depends on 
Schanuel’s conjecture, a central open problem in transcendental num- 
ber theory. Furthermore, we provide a running example of CTMCs to 
illustrate our method. 


1 Introduction 


As a popular model of probabilistic continuous-time systems, continuous-time 
Markov chains (CTMCs) have been extensively studied since Kolmogorov [25]. 
In the recent 20 years, probabilistic continuous-time model checking receives 
much attention. Adopting probabilistic computational tree logic (PCTL) [22] to 
this context with extra multiphase timed until formulas d Uh B- UTK kyi, 
for state formula @ and time interval 7, Aziz et al. proposed continuous stochas- 
tic logic (CSL) to specify the branching-time properties of CTMCs and the 
model-checking problem for CSL is decidable [8]. After that, efficient model- 
checking algorithms were developed by transient analysis of CTMCs using uni- 
formization [9] and stratification [41] for a restricted version (path formulas are 
restricted to single until formulas $1U743) and a full version of CSL, respec- 
tively. These algorithms have been practically implemented in model checkers 
PRISM [26], MRMC [24] and STORM [18]. Further details can be found in an 
excellent survey [23]. 

There are also different ways to specify the linear-time properties of CTMCs. 
Timed automata were first used to achieve this task [11,13,14,15,19], and then 


(€) The Author(s) 2022 
D. Fisman and G. Rosu (Eds.): TACAS 2022, LNCS 13244, pp. 3-21, 2022. 
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metric temporal logic (MTL) [12] was also considered in this context. Subse- 
quently, the probability of “the system being in state so within five-time units 
after having continuously remained in state s1” can be computed. However, some 
statements cannot be specified and verified because of the lack of a probabilistic 
linear-time temporal logic, for instance “the system being in state sg with high 
probability (2 0.9) within five-time units after having continuously remained in 
state s, with low probability (< 0.1)”. Furthermore, this probabilistic property 
cannot be expressed by CSL because CSL cannot express properties that are 
defined across several state transitions of the same time length in the execution 
of a CTMC. 

In this paper, targeting to express the mentioned probabilistic linear-time 
properties, we introduce continuous-time linear logic (CLL). In particular, we 
adopt the viewpoint used in [2] by regarding CTMCs as transformers of prob- 
ability distributions over states. CLL studies the properties of the probability 
distribution execution generated by a given initial probability distribution over 
time. By the fundamental difference between the views of state executions and 
probability distribution executions of CTMCs, CLL and CSL are incomparable 
and complementary, as the relation between probabilistic linear-time temporal 
logic (PLTL) and PCTL in model checking discrete-time Markov chains [2, Sec- 
tion 3.3]. 

The atomic propositions of CLL are explained on the space of probability 
distributions over states of CTMCs. We apply the method of symbolic dynamics 
to the probability distributions of CTMCs. To be specific, we symbolize the 
probability value space [0, 1] into a finite set of intervals % = {Zp C [0, 1] 7 4. 
A probability distribution u over its set of states S = (5s9,52,...,54. 1) is then 
represented symbolically as a set of symbols 


S(u) = ((s, 2) € S x J : uls) e Z) 


where each symbol (s,Z) asserts u(s) € Z, ie., the probability of state s in 
distribution y falls in interval Z. For example, (so, [0.9, 1) means the system is 
in state so with a probability in 0.9 to 1. The symbolization idea of distributions 
has been considered in [2]: choosing a disjoint cover of [0, 1]: 


T= {[0, p1), [p1, p2), Sins [Dns ATE. 


Here, we remove this restriction and enrich the expressiveness of J. A crucial 
fact about this symbolization is that the set S x J is finite. Consequently, 
the (probability distribution) execution path generated by an initial probability 
distribution u induces a sequence of symbols in S x J over time. Therefore, the 
dynamics of CTMCS can be studied in terms of a (real-time) language over the 
alphabet S x J, which is the set of atomic propositions of CLL. 

Different from non-probabilistic linear-time temporal logics — linear-time 
temporal logic (LTL) and MTL, CLL has two types of formulas: state formu- 
las and path formulas. The state formulas are constructed using propositional 
connectives. The path formulas are obtained by propositional connectives and a 
temporal modal operator timed until UT for a bounded time interval T, as in 
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MTL and CSL. The standard next-step temporal operator in LTL is meaningless 
in continuous-time systems since the time domain (real numbers) is uncountable. 
As a result, CLL can express the above mentioned probabilistic property “the 
system is at state so with high probability (> 0.9) within 5 time units after hav- 
ing continuously remained at state sı with low probability (< 0.1)" in a path 
formula: 

y = (sı, [0,0.1]) U 99) (so, (0.9, 1]). 


In this single until formula, there is a time instant 0 < t < 5 at which state sı 
with low probability transits to state so with high probability. Then we illustrate 
this on the following timeline. 


10 ļt<5 
——— (so, [0.9, 1]) 


(s1,[0,0.1]) 


Furthermore, CLL allows multiphase timed until formulas. The semantics of the 
formulas focuses on relative time intervals, i.e., time can be reset as in timed au- 
tomata [5,6], while those of CSL [8] are for absolute time intervals. Subsequently, 
CLL can express not only relatively but also absolutely temporal properties of 
CTMCs. 

We illustrate the significant difference between relatively temporal properties 
and absolutely temporal properties of CTMCs. For instance, “before probability 
distributions transition y happening in 3 to 7 time units, the system always stays 
at state sg with a high probability (> 0.9)” can be formalized in path formulae 


€! = (so, (0.9, 1) U 97 ((s,, [0, 0.1]) U 99 (so, [0.9, 1])). 


As we can see, there are two time instants, namely tı and t2, happening distribu- 
tion transitions. Time is reset to 0 after the first distribution transition happens 
and thus tə is relative to tı. More clearly, we depict this on the following timeline. 


~ Jt <7 + (t2 + t1) € 12 


4 0s a UMEN MEN (TAI CET: 


(so,[0.9,1]) (s1,[0,0.1]) 


An absolute version is “probability distribution transition y happens and the 
system always stays at state so with a high probability (> 0.9) in 3 to 7 time 
units” 


g" = DP (se, (0.9, 1]) ^ (s1, [0, 0.1] U >) (so, [0.9, 1])). 


We can get a clear timeline representation by simply adding C17 (so, [0.9, 1]) to 
that of p. Assume that t < 3, 


{0 Lt<3 13 Let 


SSS oe 
(s1,[0,0.1]) (so,[0.9,1]) 
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Time reset enriches the expressiveness of CLL but introduces more difficulties 
to model checking CLL than CSL. We cross this by translating relative time to 
the absolute one. As a result, we develop an algorithm to model check CTMCs 
against CLL formulas. More precisely, we reduce the model-checking problem to 
a reachability problem of absolute time intervals. The reachability problem corre- 
sponds to the real root isolation problem of real polynomial-exponential functions 
(PEFs) over the field of algebraic numbers, an extensively studied question in 
recent symbolic and algebraic computation community (e.g. [1,20,28]). By de- 
veloping a state-of-the-art real root isolation algorithm, we resolve the latter 
problem under the assumption of the validity of Schanuel’s conjecture, a central 
open question in transcendental number theory [27]. This conjecture has also 
been the footstone of the correctness of many recent model-checking algorithms, 
including the decidability of continuous-time Markov decision processes [30], the 
synthesizing inductive invariants for continuous linear dynamical systems [4], the 
termination analysis for probabilistic programs with delays [39], and reachability 
analysis for dynamical systems [20]. 

In summary, the main contributions of this paper are as follows. 


— Introducing a probabilistic logic, namely continuous-time linear logic (CLL), 
for reasoning about CTMCs; 

— Developing a state-of-the-art real root isolation algorithm for PEFs over the 
field of algebraic numbers for checking atomic propositions of CLL; 

— Proving that model checking CTMCs against CLL formulas is decidable 
subject to Schanuel’s conjecture. 


Organization of this paper. In the next section, we give the mathematical 
preliminaries used in this paper. In Section 3, we recall the view of CTMCs as 
distribution transformers. After that, the symbolic dynamics of CTMCs are in- 
troduced by symbolizing distributions over states of CTMCs in Section 4. In the 
subsequent section, we present our continuous-time probabilistic temporal logic 
CLL. In Section 6, we develop an algorithm to solve the CLL model checking 
problem. A case study and related works are shown in Sections 7 and 8, respec- 
tively. We summarize our results and point out future research directions in the 
final section. 


2 Preliminaries 


For the convenience of the readers, we review basic definitions and notations of 
number theory, particularly Schanuel’s conjecture. 

Throughout this paper, we write C, IR, Q and A for the fields of all complex, 
real, rational and algebraic numbers, respectively. In addition, Z denotes the set 
of all integer numbers. For F € {C,R, Q, Z, A}, we use FF[t] and F"*"* to denote 
the set of polynomials in t with coefficients in F and n-by-m matrices with every 
entry in F, respectively. Furthermore, for F € (R, Q, Zl, we use F^ to denote 
the set of positive elements (including 0) of F. 
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A bounded (time) interval T is a subset of R*, which may be open, half-open 
or closed with one of the following forms: 


[t5 , 15), [t5 , 15), (ti, to], (ti, t2), 


where 44,19 € R* and tg > tı (t4 = ta is only allowed in the case of [t4, t2]). Here, 
tı and tz are called the left and right endpoints of T , respectively. Conveniently, 
we use inf 7 and sup 7 to denote t, and t5, respectively. In this paper, we only 
consider bounded intervals. 

For reasoning about the temporal properties, we further define the addition 
and subtraction of (time) intervals. The expression T +t or t +T, fort € R*, 
denotes the interval [t +t : t € T}. Similarly, 7 — t stands for the interval 
{-t+t):t € T) ift € inf T. Furthermore, for two intervals 7; and 75, 


Ti t Ta = (te (t - T2) :t' € 3) = {tr tte: ty € T; and t; € 75). 


Two intervals 7; and 72 are disjoint if their intersection is an empty set, i.e., 
Ti 1 Ta = 0. Let us see some concrete examples: 1 + (2,3) = (3,4), (2,3) — 1 = 
(1,2), (2,3) + [3,4] = (5, 7) and (2,3), [3,4] are disjoint. It is obvious that all 
calculations of time intervals in the above are easy to be computed. 

An algebraic number is a complex number that is a root of a non-zero poly- 
nomial in one variable with rational coefficients (or equivalent to integer coeffi- 
cients, by eliminating denominators). An algebraic number a is represented by 
(P, (a, b),£) where P is the minimal polynomial of a, a,b € Q and a + bi is an 
approximation of a such that |a — (a 4- bi)| < £ and o is the only root of P in the 
open ball B(a + bi, £). The minimal polynomial of o is the polynomial with the 
smallest degree in Q[t] such that o is a root of the polynomial and the coefficient 
of the highest-degree term is 1. Any root of f(t) € A[t] is algebraic. Moreover, 
given the representations of a,b € A, the representations of a+b, § and a-b can 
be computed in polynomial time, so does the equality checking [17]. 

Furthermore, a complex number is called transcendental if it is not an al- 
gebraic number. In general, it is challenging to verify relationships between 
transcendental numbers [33]. On the other hand, one can use the Lindemann- 
Weierstrass theorem to compare some transcendental numbers. The transcen- 
dence of e and 7 are direct corollaries of this theorem. 


Theorem 1 (Lindemann-Weierstrass theorem). Let 71,--- , n, be pairwise 
distinct algebraic complex numbers. Then Y, Axe" #0 for non-zero algebraic 
numbers À,::* ,An- 


'The following concepts are introduced to study the general relation between 
transcendental numbers. 


Definition 1 (Algebraic independence). A set of complex numbers S = 
[a1,::: ,a4) is algebraically independent over Q if the elements of S do not 
satisfy any nontrivial (non-constant) polynomial equation with coefficients in Q. 


By the above definition, for any transcendental number u, {u} is algebraically 
independent over Q, while {a} for any algebraic number a € A is not. Thus, a 
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set of complex numbers that is algebraically independent over Q must consist of 
transcendental numbers. (7, e? Y?) is also algebraically independent over Q for 
any positive integer n [31]. Checking the algebraic independence is challenging. 
For example, it is still widely open whether {e,m} is algebraically independent 
over Q. 


Definition 2 (Extension field). Given two fields E C F, F is an extension 
field of E, denoted by F/E, if the operations of E are those of F restricted to 
E. 


For example, under the usual notions of addition and multiplication, the field of 
complex numbers is an extension field of real numbers. 


Definition 3 (Transcendence degree). Let L be an extension field of Q, 
the transcendence degree of L over Q is defined as the largest cardinality of an 
algebraically independent subset of L over Q. 


For instance, let Q(e)/Q = (a + be | a,b € Q} and Q(/2)/Q = (a-- bV2 | a,b € 
Q} be two extension fields of Q. Then the transcendence degree of them are 1 
and 0, respectively, by noting that e is a transcendental number and V2 is an 
algebraic number. 

Now, Schanuel's conjecture is ready to be presented. 


Conjecture 1 (Schanuel’s conjecture). Given any complex numbers 21,::: , Zn 
that are linearly independent over Q, the extension field Q(21,..., Zn, €?*, ..., e?") 
has transcendence degree of at least n over Q. 


Stephen Schanuel proposed this conjecture during a course given by Serge 
Lang at Columbia in the 1960s [27]. Schanuel’s conjecture concerns the transcen- 
dence degree of certain field extensions of the rational numbers. The conjecture, 
if proven, would generalize the most well-known results in transcendental num- 
ber theory significantly [29,37]. For example, the algebraical independence of 
{e,n} would simply follow by setting z; = 1 and zg = vi, and using Fuler’s 
identity e"* 4- 1 — 0. 


3 Continuous-time Markov Chains as Distributions 
Transformers 


We begin with the definition of continuous-time Markov chains (CTMCs). A 
CTMC is a Markovian (memoryless) stochastic process that takes values on a 
finite state set S (|S| = d < oo) and evolves in continuous-time t € R*. Formally, 


Definition 4. A CTMC is a pair M = (S,Q), where S (|S| = d) is a finite 
state set and Q € Q?*4 is a transition rate matrix. 


A transition rate matrix Q is a matrix whose off-diagonal entries {Q;,;}iz; are 
nonnegative rational numbers, representing the transition rate from state à to 
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state j, while the diagonal entries {Q; j} are constrained to be — ber Qij for 
all 1 € j € d. Consequently, the column summations of Q are all zero. 

The evolution of a CTMC can be regarded as a distribution transformer. 
Given initial distribution u € Q4^*! € D(S), the distribution at time t € R* is: 


Qt 


pi = e p, 


where D(S) is denoted as the set of all probability distributions over S. We call 
D(S) the probability distribution space of CTMCs. An execution path of CTMCs 
is a continuous function indexed by initial distribution u € D(S): 


op: R? > D(S), c, (t) = e*u. (1) 


Example 1. We recall the illustrating example of CTMC M = (S,Q) in [8, 
Figure 1] as the running example in our work. In particular, M is a 5-dimensional 
CTMC with initial distribution u, where S = (59, $1, $2, 53, 54] and 


—30 0 00 0.1 
10000 0.2 
Q—-|20-700 u= | 0.38 
00 3 00 0.4 
00 4 00 0 


4 Symbolic Dynamics of CTMCs 


In this section, we introduce symbolic dynamics to characterize the properties 
of the probability distribution space of CTMCs. 

First, we fix a finite set of intervals .% = {Tp C [0,1] e, where the end- 
points of each Zi are rational numbers. With the states S = (5so,51,:-: , sq 1); 
we define the symbolization of distributions as a function: 


S:D(S)O29* JS(u)—[(sT)€Sx.:yu(s) €T}, (2) 


where x denotes the Cartesian product, and 2?* is the power set of S x 
JF. (s, Z) € S(u) asserts that the probability of state s in distribution p is 
in the interval Z. The symbolization of distributions is a generalization of the 
discretization of distributions with Z,NZ,, = Í for all k 4 m which was studied in 
[2]. This generalization increases the expressiveness of our continuous linear-time 
logic introduced in the next section. Now, we can represent any given probability 
distribution by finite symbols from S x J. For example, suppose 


JF = ([0, 0.1], (0.1, 0.9), [0.9, 1], [1, 1], [0.4, 0.4]}, (3) 
and then the initial distribution in Example 1 is symbolized as 


S(u) = 1(so, [0, 0.1]), (s1, (0.1, 0.9)), (s2, (0.1, 0.9)), 
(s3, (0.1, 0.9)), (s3, [0.4, 0.4]), (s4, [0, 0.1])}. 
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As we can see from the above example, the symbolization of distributions on 
states considers the exact probabilities (singleton intervals) of the states and the 
range of their possibilities. 

Next, we introduce the symbolization to CTMCs, 


Definition 5. A symbolized CTMC is a tuple SM = (S,Q,.7), where M = 
(S, Q) is a CTMC and J is a finite set of intervals in [0,1]. 


As we can see, the set of intervals is picked depending on CTMCs. Then, we 
extend this symbolization to the path o,: 


Sod; : RT > 29%7, (5) 


Definition 6. Given a symbolized CTMC SM = (S, Q, J), Soo, is a symbolic 
execution path of M = (S, Q). 


Given a symbolized CTMC SM = (S, Q, .7), the path o, of CTMC M = (5,Q) 
over real numbers Rt generated by probability distribution u induces a symbolic 
execution path S o c,, over finite symbols S x J. Subsequently, the dynamics 
of CTMCs can be studied in terms of a language over S x J. In other words, 


we can study the temporal properties of CTMCs in the context of symbolized 
CTMCs. 


5 Continuous Linear-time Logic 


In this section, we introduce continuous linear-time logic (CLL), a probabilistic 
linear-time temporal logic, to specify the temporal properties of a symbolized 
CTMC SM = (S,Q,-%). 

CLL has two types of formulas: state formulas and path formulas. The state 
formulas are constructed using propositional connectives. The path formulas are 
obtained by propositional connectives and a temporal modal operator timed until 
UT for a bounded time interval T, as in MTL and CSL. Furthermore, multiphase 
timed until formulas PoU Tp UTP... UT”, are allowed to enrich the expres- 
siveness of CLL. More importantly, time reset is involved in these multiphase 
formulas. Thus absolutely and relatively temporal properties of CTMCs can be 
studied. 


Definition 7. The state formulas of CLL are described according to the follow- 
ing syntax: 
@ := true | a E€ AP | ^9 | $1 ^ D2 
where AP denotes S x ¥ as the set of atomic propositions. 
The path formulas of CLL are constructed by the following syntaz: 


y :— true | $9U71 BLUS... UT 6, | 5p | v1 ^ Ye 


where n € Zt is a positive integer, for all 0 < k < n, Bp is a state formula, 
and T;,’s are time intervals with the endpoints in QT, i.e., each Ty is one of the 
following forms: 

(a, b), [a,b], (a, b], [a, b) Va,b € Qt. 


A Probabilistic Logic for Verifying Continuous-time Markov Chains 11 


The semantics of CLL state formulas is defined on the set D(S) of probability 
distributions over S with the symbolized function S in Eq.(2) of Section 4. 


(1) u E- true for all probability distributions u € D(S); 

(2) u F a iff a € S(u); 

(3) u H 7@ iff it is not the case that u = 9 (or written u JÆ 9 ); 
(4) H = p Az iff u = pı and u | 95. 


The semantics of CLL path formulas is defined on execution paths {0 }ueD(S) 
of CTMC M = (S, Q). 


(1) o, E true for all probability distributions u € D(S); 

) e, E PoU BUTS . .. UT" ,, iff there is a time instant t € 7; such that 
On E P UTB... UT Bn, and for any t € Ti N [0,t), ux H Bo, where 
On, H| iff u | , and u is the distribution of the chain at time instant t, 
i.e., ui = etu Vt € Rt; 

(3) e, FP ^v iff it is not the case that o, E- y (written o,  ¢ ); 
) On E y1 A v2 iff o, E v1 and e, E v». 


Not surprisingly, other Boolean connectives are derived in the standard way, 
i.e., false = true, $4 V B2 = ^(54, ^ 295) and $41 > Pa = -Pı V B2, and 
the path formula q follows the same way. Furthermore, we generalize temporal 
operators Q (“eventually”) and O (“always”) of discrete-time systems into their 
timed variant 07 and O7 , respectively, in the following: 


07 $ = trueU? p T$ =T 6, 


For n = 1 in multiphase timed until formulas, the until operator U™ is a 
timed variant of the until operator of LTL; the path formula $9U7: 4, asserts 
that $4 is satisfied at some time instant in the interval 7; and that at all pre- 
ceding time instants in 71, o holds. For example, 


y = (sı, [0, 0.1] U 959! (so, (0.9, 1]), 


as mentioned in introduction section. 

For general n, the CLL path formula PoU TS U T283... UT- Ð, is explained 
over the induction on n. We first mention that U7 is right-associative, e.g., 
$9U T1 P UTP stands for PoU T (64U7* 2). This makes time reset, i.e., 7; and 
Ta do not have to be disjoint, and the starting time point of 75 is based on some 
time instant in 71. Recall the multiphase timed until formula in introduction 
section and this formula expresses a relative time property: 


g = (so, [0.9, 1) UBT ((s1, [0, 0.1]) U 9 (so, [0.9, 1])), 


which is different to the following CLL path formula representing an absolutely 
temporal property of CTMCs: 


g" = OB: s9, [0.9, 1]) ^ (1, [0, 0.1] U 99) (se, [0.9, 1])). 
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As an example, we clarify the semantics of CLL by comparing the above two 
path formulas in general forms: 


PoU PLU P, and BU", ^ GU" Sy. 


(1) on = PUT P UTB asserts that there are time instants tı € Ti,t2 € h 
such that i44, H= ®2 and for any t € 7; N [0, t1) and t$ € Ta N [0, t2), 
He, = o and Utt, F 91, where py = e&t u Vt € R*. This is more clear in 
the following timeline. 


—inf 71 —inf 75 
{ Bo 
r V A ti € sup tı + te) € sup(71 + T; 
+ time 0 > 1 P 7/1 a 1 2) p( 1 2) 
(2) a, H 69UT! B ^ BUT asserts that there are time instants tı € Ti, t2 € 


T such that pz, = $41 and ju, H d», and for any t € Ti AO [0, tı) and 
t5 € T2 (0, t2), uv, H Bo and py, = 91, where py = Fn Vt e Rt. 


Before solving the model-checking problem of CTMCs against CLL formulas 
in the next section, we shall first discuss what can be specified in our logic CLL. 

Given a CTMC (S, Q), CLL path formula ¢!9:19°l (5. (1, 1]) expresses a live- 
ness property that state s € S is eventually reached with probability one before 
time instant 1000. In terms of safety properties, formula [11100.1000) (5. [0, 0]) rep- 
resents that state s € S is never reached (reached with probability zero) between 
time instants 100 and 1000. Furthermore, setting the intervals nontrivial (neither 
[0, 0] or [1, 1]), liveness and safety properties can be asserted with probabilities, 
such as 19-1000] (s. (9.5, 1]) and C11100:1000] (5, (0, 0.5]). For multiphase timed un- 
til formula (s, [0.7, 1) U3] (s, (0.7, 1]) ...U 2.3 (s, [0.7, 1]), where the number of 
U!.3] is 100, asserts that the probability of state s is beyond 0.7 in every time 
instant 2 to 3, and this happens at least 100 times. 

Next, we can classify members of J as representing “low” and “high” prob- 
abilities. For example, if ./ contains 3 intervals {[0,0.1], (0.1,0.9), [0.9, 1], we 
can declare the first interval as “low” and the last interval as “high”. In this 
case [(10:1000) ((s6. [0,0.1]) — (s1, [0.9, 1])) says that, in time interval [10, 1000), 
whenever the probability of state so is low, the probability of state s; will be 
high. 


6 CLL Model Checking 


In this section, we provide an algorithm to model check CTMCs against CLL 
formulas, i.e., the following CLL model-checking problem — Problem 1 is decid- 
able. 


Problem 1 (CLL Model-checking Problem). Given a symbolized CTMC SM = 
(S,Q, 4%) with an initial distribution u and a CLL path formula y on AP = 
S x .£, the goal is to decide whether o, = v, where c, (t) = ep is an execution 
path defined in Eq.(1). 
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In particular, we show that 


Theorem 2. Under the condition that Schanuel’s conjecture holds, the CLL 
model-checking problem in Problem 1 is decidable. 


In the following, we prove the above theorem from checking basic formulas 
— atomic propositions to the most complex one — nontrivial multiphase timed 
until formulas. For readability, we put the proofs of all results in Appendix A of 
the extended version [21] of this paper. 

We start with the simplest case of atomic proposition (s,Z). By the semantics 
of CLL, 4 H (s,Z) if and only if u+ = e9*ju(s) € T. To check this, we first observe 
that the execution path e@*y of CTMCs is a system of polynomial exponential 
functions (PEFs). 


Definition 8. A function f : IR — R is a polynomial-exponential function 
(PEF) if f has the following form: 


K 
fos foe’ (6) 


where for all 0 € k € K < oo, f(t) € Fit], fe(t) 4 0, Ak € Fo and F4,F5 are 
fields. Without loss of generality, we assume that Ay's are distinct. 


Generally, for a PEF f(t) with the range in complex numbers C, g(t) = 
f(t) + f*(t) is a PEF with the range in real numbers R, where f*(t) is the 
complex conjugate of f(t). The factor t is omitted whenever convenient, i.e., 
f — f(t). t is called a root of a function f if f(t) — 0. PEFs often appear in 
transcendental number theory as auxiliary functions in the proofs involving the 
exponential function [10]. 


Lemma 1. Given a CTMC M = (S,Q) with S = {s0,..., sai}, Q € Q?*4, 
and an initial distribution u € Q**!, for any 0 Xx i € d—1 , e9*y(s;), the i-th 
entry of etu, can be expressed as a PEF f : R* — [0,1] as in Eq.(6) with 
F, = Fo =A. 


By the above lemma, for a given t in some bounded time interval T (to be specific 
in the latter discussion), e@’y(s) € Z is determined by the algebraic structure 
of PEF g(t) = e9'u(s) in T. That is all maximum intervals Tmax C T such 
that g(t) € Z for all t € Tmax, where interval Tmax Æ @ is called maximum for 
g(t) € T if no sub-intervals 7” C Tmax such that the property holds, i.e., g(t) € T 
for all t € T’. Then e9'u(s) € T if and only if t € Tmax for some maximum 
interval Tmax. So, we aim to compute the set .7 of all maximum intervals. By 
the continuity of PEF g(t), this can be done by identifying a real root isolation 
of the following PEF f(t) in 7: f(t) = (g(t) — inf Z)(g(t) — supZ). 

A (real) root isolation of function f(t) in interval J is a set of mutually 
disjoint intervals, denoted by Iso(f)r = {(a;,b;) C T) for aj,b; € Q such that 


— for any j, there is one and only one root of f(t) in (a;,b;); 
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— for any root t* of f(t), t* € (a;,b;) for some j. 


Furthermore, if f has no any root in 7, then Iso(f)7- = 0. 

Although there are infinite kinds of real root isolations of f(t) in 7, the 
number of isolation intervals equals to the number of distinct roots of f(t) in T. 

Finding real root isolations of PEFs is a long-standing problem and can be 
at least backtracked to Ritt's paper [34] in 1929. Some following results were 
obtained since the last century (e.g. [7,38]. This problem is essential in the 
reachability analysis of dynamical systems, one active field of symbolic and al- 
gebraic computation. In the case of F; = Q and Fə = N* in [1], an algorithm 
named ISOL was proposed to isolate all real roots of f(t). Later, this algorithm 
has been extended to the case of Fy = Q and Fə = R [20]. A variant of the 
problem has also been studied in [28]. The correctness of these algorithms is 
based on Schanuel's conjecture. Other works are using Schanuel's conjecture to 
do the root isolation of other functions, such as exp-log functions [35] and tame 
elementary functions [36]. 

By Lemma 1, we pursue this problem in the context of CTMCs. The distinct 
feature of solving real root isolations of PEFs in our paper is to deal with complex 
numbers C, more specifically algebraic numbers A, i.e., Fy = FF = A. At the 
same time, to the best of our knowledge, all the previous works can only handle 
the case over IR. Here, we develop a state-of-the-art real root isolation algorithm 
for PEFs over algebraic numbers. Thus from now on, we always assume that 
PEFs are over A, i.e., F; = FF; = A in Eq.(6). In this case, it is worth noting 
that whether a PEF has a root in a given interval, 7 C Rt is decidable subject 
to Schanuel’s Conjecture if 7 is bounded [16], which falls in the situation we 
consider in this paper. 


Theorem 3 ([16]). Under the condition that Schanuel’s conjecture holds, there 
is an algorithm to check whether a PEF f(t) has a root in interval T, i.e., 
whether Iso(f)7 = 0. 


In this paper, we extend the above checking Iso(f)r = 0 to computing 
Iso(f)r of PEF f(t). 


Theorem 4. Under the condition that Schanuel’s conjecture holds, there is an 
algorithm to find real root isolation Iso(f)7 for any PEF f(t) and interval T. 
Furthermore, the number of real roots is finite, i.e., |Iso(f)| < oo. 


We can compute the set 7 of all maximum intervals with the above theorem 
to check atomic propositions. Furthermore, we can compare the values of any 
real roots of PEFs, which is important in model checking general multiphase 
timed until formulas at the end of this section. 


Lemma 2. Let fi(t) and fa(t) be two PEFs with the domains in Ti and T», 
and tı € Ti and t2 € To are roots of them, respectively. Under the condition that 
Schanuel's conjecture holds, there is am efficient way to check whether or not 
tı — t2 < g for any given rational number g € Q. 
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For model checking general state formula ®, we can also use real root isolation 
of some PEF to obtain the set of all maximum intervals Tmax such that u; = & 
for all t € Tmax. The reason is that $ admits conjunctive normal form consisting 
of atomic propositions. See the proof of the following lemma in Appendix A of 
the extended version [21] of this paper for the details. 


Lemma 3. Under the condition that Schanuel’s conjecture holds, given a time 
interval T, the set 7 of all maximum intervals in T satisfying u; = 9 can be 
computed, where ® is a state formula of CLL. Furthermore, the number of all 
intervals in JF is finite; the left and right endpoints of each interval in 7 are 
roots of PEFs. 


At last, we characterize the multiphase timed until formulas by the reacha- 
bility analysis of time intervals (instants). 


Lemma 4. o, = PUT PLUTO -UTBa if and only if there exist time in- 
tervals (Zi C R* Y? 9 with Zo = [0,0] such that 


— The satisfaction of intervals: for all 1 < k < n, pe H| y 1 for all t € Tk, 
and pu» = Bn, where t* = supZ, and py = eltu Vt c Rt; 

— The order of intervals: for all 1 < k < n, Ip C Zy.4 + Ty and inf Tp = 
sup Zk—1 + inf Tk. 


By the above lemma, the problem of checking multiphase timed until formulas 
is reduced to verify the existence of a sequence of time intervals. 
Now we can show the proof of Theorem 2. 


Proof. Recall that the nontrivial step is to model check multiphase timed until 
formula $9U 1 $1U7?$, . ..UT^ $,, where {Tj} Poa is a set of bounded rational 
intervals in Rt, and for 0 < k < n + 1, ®, is a state formula. 

By Lemma 4, for model checking the above formula, we only need to check 
the existence of time intervals (7,) 7. illustrated in the lemma. The following 
procedure can construct such a set of intervals if it exists: 


— (1) Let % = {Zo = [0,0]} ; 

— (2) For each 1 € k < m, obtaining the set Jp in ODE sup 7;] of all 
maximum intervals such that u; = k—ı for all t € Z of Z € J, where 
ut = et u; this can be done by Lemma 3. Noting that .% can be the empty 
set, i.e., Jp =O; 

— (3) Let k from 1 to n. First, updating Ip: 


Ip = {IN (T -TO:e.f and T' €. a]. (7) 


The above updates can be finished by Lemma 2. If Jp = (), then the formula 
is not satisfied; 

— (4) Updating Jn: for each T € Jn, we replace Z with [s — £, s) for some 
constant £ > 0 if there is an s € Z with s — € € Z such that us H= Pn where 
ls = e95yu; Otherwise, remove this element from .%,. Again, this can be 
done by Lemma 3. If Jn =, then the formula is not satisfied; 
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— (5) Finally, let k from n — 1 to 1, updating Jp: 
Ip = {[s — inf Tk, s — inf T4] : [s — €, 5) € -A4])- 


Thus after the above procedure, we have non-empty sets [./,)5..9 with the 
following properties. 


— for each 1 € k € n, ju = Bk—ı for all t € Zp and Tk € Ik, and ju H| Bn, 
where t* = supZ,; 

— for each 1 € k < n, T € Jp, there exists at least one Z’ € %,_1 such that 
T C supZ' + Ty and inf Z = sup T’ + inf Th. 


Therefore, we can get a set of intervals {Zk ) 7. satisfying the two conditions 
in Lemma 4 if it exists. On the other hand, it is easy to check that all such 
{Tk }ko must be in {.%,}R_o, ie., for each k, Tp C T for some Z € Ip. This 
ensures the correctness of the above procedure. 


By the above constructive analysis, we give an algorithm for model checking 
CTMCs against CLL formulas. Focusing on the decidability problem, we do 
not provide the pseudocode of the algorithm. Alternatively, we implement a 
numerical experiment to illustrate the checking procedure in the next section. 


7 Numerical Implementation 


In this section, we implement a case study of checking CTMCs against CLL 
formulas. Here, we consider a symbolized CTMC SM = (S,Q,.%), where M = 
(S,Q) is the CTMC in Example 1 and finite set J is the one considered in 
Eq.(3). We check the properties of M given by the following two CLL path 
formulas mentioned in the introduction for different initial distributions. 

p= (81, (0, 0.1) u 165] (so, [0.9, 1]). 


y! = (so, [0.9, 1) U 71 (s,, [0, 0.1]) U 95] (so, [0.9, 1]). 


By Jordan decomposition, we have Q = SJ.S-! where 


0 —6000 -7 0 000 4 0-400 
0 2001 0 -3000 -i0 0 00 
S= | -7-3000 J-|0 0000 S'-|$0j01 
3 3010 0 0 000 2 0 210 
4 4 100 0 0 000 i10 00 


Then, we consider an initial distribution 4 as the same as the one in Example 1. 
Then we have that the value of e9', is as follows: 


e 3 0 0 00 0.1 we 
—i(g-9—1) 1 0 00] | 0.2 -et 4 39 
l(e?*t—-e7) 0 e" 600] 103] = co ae 

xe je 9t + i 0 že- 3 10 0.4 ae 28 —'"Tt + 3 
2 5—Tt 2,—3t 8 4 ,—'Tt 4 1 ,—3t —Tt 22 
7€ 3€ Far 0 —Fe 701 0 15* 7e + 105 
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As we only consider states sg and s, in formulas q and v, we focus on the 
following PEFs: fo(t) = 45e ?' and fi(t) 2 — 5e ?* + T. 

Next, we initialize the model checking procedures introduced in the proof of 
Theorem 2. First, we compute the set .7 of all maximum intervals 7 C [0,5] 
such that e®tu H (so, [0.9, 1]) for t € T, i.e., fo(t) € [0.9, 1] for t € T. We obtain 
J = by the real root isolation algorithm mentioned in Theorem 4, and this 
indicates that o, I^ where o,,(t) = e@*y is the path induced by pz and defined 
in Eq.(1). 

To check whether o, = y’, we compute the set 7 of all maximum intervals 
T C [0,12] such that e®tpu H (so, [0.9, 1]) for t € T, i.e., folt) € [0.9, 1] fort € T. 
Again, we obtain .7 — () by the real root isolation algorithm in Theorem 4. 
Therefore, c, E g’. 

In the following, we consider a different initial distribution p as follows: 


0.9 me 
3 (5—3t 
ety, =e%}o1] = ge _ Ten 
0 9^—3t , 35—7t 3 
20,° 20 © F 10 
0 36-3t 4 16-7t 4 
5 F5 F5 


The key PEFs are: go(t) = ĝe% and gi(t) = — S (e ** — 1). 
Again, we initialize the model checking procedures introduced in the proof of 
Theorem 2. We first compute the set .7 of all maximum intervals 7 C [0,5] such 


that etu, H (s,,[0,0.1]) for t € T, i.e., gi(t) € [0,0.1] for t € T. This can be 


done by finding a real root isolation of the following PEF: g$(t) = —$5(e ?' — 
1) - i. 


By implementing the real root isolation algorithm in Theorem 4, we have 
Iso(97) 0,5) = ((0.13,0.14)) and then 7 = {[0,t"]} for t* € (0.13,0.14). 


Following the same way, we compute J for etu, E- (so,[0.9, 1]). Then we 
complete the model checking procedures in the proof of Theorem 2, and we 
conclude: o, = v. By repeating these, the result of the second formula y’ is 


yu; a g. 


8 Related Works 


Agrawal et al. [2] introduced probabilistic linear-time temporal logic (PLTL) to 
reason about discrete-time Markov chains in the context of distribution trans- 
formers as we did for CTMCs in this paper. Interestingly, the Skolem Prob- 
lem can be reduced to the model checking problem for the logic PLTL [3]. The 
Skolem Problem asks whether a given linear recurrence sequence has a zero term 
and plays a vital role in the reachability analysis of linear dynamical systems. 
Unfortunately, the decidability of the problem remains open [32]. Recently, the 
Continuous Skolem Problem has been proposed with good behavior (the problem 
is decidable) and forms a fundamental decision problem concerning reachability 
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in continuous-time linear dynamical systems [16]. Not surprisingly, the Continu- 
ous Skolem Problem can be reduced to model-checking CLL. The primary step 
of verifying CLL formulas is to find a real root isolation of a PEF in a given 
interval. Chonev, Ouaknine and Worrell reformulated the Continuous Skolem 
Problem in terms of whether a PEF has a root in a given interval, which is 
decidable subject to Schanuel's conjecture [16]. An algorithm for finding root 
isolation can also answer the problem of checking the existence of the roots of a 
PEF. However, the reverse does not work in general. Therefore, the decidability 
of the Continuous Skolem Problem cannot be applied to establish that of our 
CLL model checking. 


Remark 1. By adopting the method in this paper, we established the decidability 
of model checking quantum CTMCs against signal temporal logic [40]. Again, 
we need Schanuel’s conjecture to guarantee the correctness. A Lindblad's master 
equation governs a quantum CTMC and a more general real-time probabilistic 
Markov model than a CTMC, i.e., a CTMC is an instance of quantum CTMCs. 
We converted the evolution of Lindblad's master equation into a distribution 
transformer that preserves the laws of quantum mechanics. We reduced the 
model-checking problem of quantum CTMCs to the real root isolation problem, 
which we considered in this paper, and thus our method could be applied to it. 


9 Conclusion 


This paper revisited the study of temporal properties of finite-state CTMCs by 
symbolizing the probability value space [0,1] into a finite set of intervals. To 
specify relatively and absolutely temporal properties, we propose a probabilistic 
logic for CTMCs, namely continuous linear-time logic (CLL). We have considered 
the model checking problem in this setting. Our main result is that a state-of-the- 
art real root isolation algorithm over the field of algebraic numbers was proposed 
to establish the decidability of the model checking problem under the condition 
that Schanuel's conjecture holds. 

This paper aims to show decidability in as simple a fashion as possible with- 
out paying much attention to complexity issues. Faster algorithms on our current 
constructions would significantly improve from a practical standpoint. 
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Abstract We consider the problem: is the optimal expected total re- 
ward to reach a goal state in a partially observable Markov decision 
process (POMDP) below a given threshold? We tackle this—generally 
undecidable—problem by computing under-approximations on these total 
expected rewards. This is done by abstracting finite unfoldings of the 
infinite belief MDP of the POMDP. The key issue is to find a suitable 
under-approximation of the value function. We provide two techniques: a 
simple (cut-off) technique that uses a good policy on the POMDP, and 
a more advanced technique (belief clipping) that uses minimal shifts of 
probabilities between beliefs. We use mixed-integer linear programming 
(MILP) to find such minimal probability shifts and experimentally show 
that our techniques scale quite well while providing tight lower bounds 
on the expected total reward. 


1 Introduction 


The relevance of POMDPs. Partially observable Markov decision processes (POM- 
DPs) originated in operations research and nowadays are a pivotal model for 
planning in AI [40]. They inherit all features of classical MDPs: each state has a 
set of discrete probability distributions over the states and rewards are earned 
when taking transitions. However, states are not fully observable. Intuitively, 
certain aspects of the states can be identified, such as a state's colour, but states 
themselves cannot be observed. This partial observability reflects, for example, a 
robots view of its environment while only having the limited perspective of its 
sensors at its disposal. The main goal is to obtain a policy—a plan how to resolve 
the non-determinism in the model—for a given objective. The key problem here 
is that POMDP policies must base their decisions only on the observable aspects 
(e.g. colours) of states. This stands in contrast to policies for MDPs which can 
make decisions dependent on the entire history of full state information. 

Analysing POMDPs. Typical POMDP planning problems consider either finite- 
horizon objectives or infinite-horizon objectives under discounting. Finite-horizon 
objectives focus on reaching a certain goal state (such as "the robot has collected 
all items") within a given number of steps. For infinite horizons, no step bound 
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is provided and typically rewards along a run are weighted by a discounting 
factor that indicates how much immediate rewards are favoured over more distant 
ones. Existing techniques to treat these objectives include variations of value 
iteration [46,36,20,18,52,53] and policy trees [29]. Point-based techniques [38,42] 
approximate a POMDP’s value function using a finite subset of beliefs which is 
iteratively updated. Algorithms include PBVI [38], Perseus [48], SARSOP [30] 
and HSVI [45]. Point-based methods can treat large POMDPs for both finite- 
and discounted infinite-horizon objectives [42]. 


Problem statement. In this paper we consider the problem: is the maximal expected 
total reward to reach a given goal state in a POMDP below a given threshold? 
We thus consider an infinite-horizon objective without discounting—also called 
an indefinite-horizon objective. A specific instance of the considered problem is 
the reachability probability to eventually reach a given goal state in a POMDP. 
This problem is undecidable [33,34] in general. Intuitively, this is due to the fact 
that POMDP policies need to consider the entire (infinite) observation history 
to make optimal decisions. For a POMDP, this notion is captured by an infinite, 
fully observable MDP, its belief MDP. This MDP is obtained from observation 
sequences inducing probabilities of being in certain states of the POMDP. 

Previously proposed methods to solve the problem are e.g. to use approx- 
imate value iteration [22], optimisation and search techniques [1,12], dynamic 
programming [6], Monte Carlo simulation [43], game-based abstraction [51], and 
machine learning [13,14,19]. Other approaches restrict the memory size of the 
policies [35]. The synthesis of (possibly randomised) finite-memory policies is 
ETR-complete! [28]. Techniques to obtain finite-memory policies use e.g. para- 
meter synthesis [28] or satisfiability checking and SMT solving [15,50]. 


Our approach. We tackle the aforementioned problem by computing under- 
approximations on maximal total expected rewards. This is done by considering 
finite unfoldings of the infinite belief MDP of the POMDP, and then applying 
abstraction. The key issue here is to find a suitable under-approximation of 
the POMDP’s value function. We provide two techniques: a simple (cut-off) 
technique that uses a good policy on the POMDP, and a more advanced tech- 
nique (belief clipping) that uses minimal shifts of probabilities between beliefs 
and can be applied on top of the simple approach. We use mixed-integer linear 
programming (MILP) to find such minimal probability shifts. Cut-off techniques 
for indefinite-horizon objectives have been used on computation trees—rather 
than on the belief MDP as used here—in Goal-HSVI [24]. Belief clipping amends 
the probabilities in a belief to be in a state of the POMDP yielding discretised 
values, i.e. an abstraction of the probability range [0, 1] is applied. Such grid-based 
approximations are inspired by Lovejoy’s grid-based belief MDP discretisation 
method [32]. They have also been used in [7] in the context of dynamic pro- 
gramming for POMDPs, and to over-approximate the value function in model 
checking of POMDPs [8]. In fact, this paper on determining lower bounds for 


1 A decision problem is ETR-complete if it can be reduced to a polynomial-length 
sentence in the Existential Theory of the Reals (for which the satisfiability problem is 
decidable) in polynomial time, and there is such a reduction in the reverse direction. 
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indefinite-horizon objectives can be seen as the dual counterpart of [8]. Our key 
challenge—compared to the approach of [8] —is that the value at a certain belief 
cannot easily be under-approximated with a convex combination of values of 
nearby beliefs. On the other hand, an under-approximation can benefit from a 
"good" guess of some initial POMDP policy. In the context of [8], such a guessed 
policy is of limited use for over-approximating values in the POMDP induced 
by an optimal policy. Although our approach is applicable to all thresholds, the 
focus of our work is on determining under-approximations for quantitative object- 
ives. Dedicated verification techniques for the qualitative setting—almost-sure 
reachability—are presented in [17,16,27]. 


Experimental results. We have implemented our cut-off and belief clipping ap- 
proaches on top of the probabilistic model checker STORM [23] and applied it to a 
range of various benchmarks. We provide a comparison with the model checking 
approach in [37], and determine the tightness of our under-approximations by 
comparing them to over-approximations obtained using the algorithm from [8]. 
Our main findings from the experimental validation are: 

— Cut-offs often generate tight bounds while being computationally inexpensive. 
— The clipping approach may further improve the accuracy of the approximation. 
— Our implementation can deal with POMDPs with tens of thousands of states. 
— Mostly, the obtained under-approximations are less than 1096 off. 


2 Preliminaries and Problem Statement 


Let Dist(A) := {u : A > [0,1] | Zaca ula) = 1] denote the set of probability 
distributions over a finite set A. The set supp(u) := (a € A | n(a) > 0) is the 
support of p € Dist(A). Let IR^? := RU (oo, —oc]. We use Iverson bracket 
notation, where [x] = 1 if the Boolean expression x is true and [a] = 0 otherwise. 


2.1 Partially Observable MDPs 


Definition 1 (MDP). A Markov decision process (MDP) is a tuple M = 
(S, Act, P, Sinit) with a (finite or infinite) set of states S, a finite set of actions 
Act, a transition function P: S x Act x S — [0,1] with * pes P(s, a, 8’) € (0,1) 
for all s € S and a € Act, and an initial state Sinit. 


We fix an MDP M := (S, Act, P, Sini). For s € S and o € Act, let post" (s, a) :— 
[s € S | P(s,o, s’) > 0} denote the set of a-successors of s in M. The set of 
enabled actions in s € S is given by Act(s) :— {a € Act | post" (s, o) 4 Ø}. 


Definition 2 (POMDP). A partially observable MDP (POMDP) is a tuple 
M = (M,Z,O), where M is the underlying MDP with |S| € N, i.e. S is finite, 
Z is a finite set of observations, and O: S — Z is an observation function such 


that O(s) = O(s') —» Act(s) = Act(s') for all s,s’ € S. 


We fix a POMDP M :— (M, Z, O) with underlying MDP M. We lift the notion 
of enabled actions to observations z € Z by setting Act(z) := Act(s) for some 
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s € S with O(s) = z which is valid since states with the same observations are 
required to have the same enabled actions. The notions defined for MDPs below 
also straightforwardly apply to POMDPs. 


Remark 1. More general observation functions of the form O : S x Act — Dist(Z) 
can be encoded in this formalism by using a polynomially larger state space [16]. 


An infinite path through an MDP (and a POMDP) is a sequence 7 = 59015102... 
such that a;41 € Act(s;) and s;,1 € post" (si, o; 1) for all i € N. A finite path is 
a finite prefix 7t = $901 . . . 0/45, of an infinite path 7. For finite 7 let last(7) :— s, 
and |7| := n. For infinite 7 set |7| :— oo and let 7[i] denote the finite prefix of 
length i € N. We denote the set of finite and infinite paths in M by Paths% 
and Paths}, respectively. Let Paths :— Paths U Paths%,. Paths are lifted to 
the observation level by observation traces. The observation trace of a (finite or 
infinite) path m = s9015102... € Paths" is O(x) := O(so)o1O(s1)oa .... Two 
paths m,n’ € Paths" are observation-equivalent if O(x) = O(n’). 

Policies resolve the non-determinism present in MDPs (and POMDPs). Given 
a finite path 7, a policy determines the action to take at last(7). 


Definition 3 (Policy). A policy for M is a function c : Paths}! — Dist(Act) 
such that for each path & € Paths§,, supp(o(f)) C Act (last (i)). 


A policy ø is deterministic if |supp(a())| = 1 for all # € Paths. Otherwise 
it is randomised. o is memoryless if for all 7,7’ € Paths we have last(#) = 
last(4^) —> o(i) = c(t). a is observation-based if for all 7,7’ € Paths, it 
holds that O() = O(7’) a(it) = c(f^). We denote the set of policies for M 
by XV and the set of observation-based policies for M by XM.. A finite-memory 
policy (fm-policy) can be represented by a finite automaton where the current 
memory state and the state of the MDP determine the actions to take [4]. 

The probability measure $y for paths in M under policy c and initial state 
s is the probability measure of the Markov chain induced by M, c, and s [4]. 

We use reward structures to model quantities like time, or energy consumption. 


Definition 4 (Reward Structure). A reward structure for M is a function 
R: S x Act x S 2 R such that either for all s,s' € S, a € Act, R(s,o,s) > 0 
or for all s,s’ € S, a € Act, R(s,o, s') < 0 holds. In the former case, we call R 
positive, otherwise negative. 


We fix a reward structure R for M. The total reward along a path 7 is defined 
as rew y, R(T) :— S R(s; 1,0,,5;). The total reward is always well-defined— 
even if 7 is infinite—since all rewards are assumed to be either non-negative or 
non-positive. For an infinite path 7 we define the total reward until reaching a 
set of goal states G C S by 


rewu,R(t) ifJdie N:«t—m«[i] ^ last(t) € GA 
rew RG (7) = Vj «d: last(z|j]) [2 G, 
rewm R(T) otherwise. 
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Intuitively, rew yg, G (4) accumulates reward along 7 until the first visit of a goal 
state s € G. If no goal state is reached, reward is accumulated along the infinite 
path. The expected total reward until reaching G for policy o and state s is 


ERS n(s = OG) = [emet das (dñ). 


TE Paths, 


Observation-based policies capture the notion that a decision procedure for a 
POMDP only accesses the observations and their history and not the entire state 
of the system. We are interested in reasoning about minimal and maximal values 
over all observation-based policies. For our explanations we focus on maximising 
(non-negative or non-positive) expected rewards. Minimisation can be achieved 
by negating all rewards. 


Definition 5 (Maximal Expected Total Reward). The maximal expected 
total reward until reaching G from s in POMDP M is 


ERW R(s = OG) := sup ER n(s H 0G). 
cexA 
We define ER Rr (OG) := ER R(sii F OG). 


The central problem of our work, the indefinite-horizon total reward problem, 
asks the question whether the maximal expected total reward until reaching a 
goal exceeds a given threshold. 


Problem 1. Given a POMDP M, reward structure R, set of goal states 
G C S, and threshold A € R, decide whether ER R(O0G) € A. 


Example 1. Fig. 1 shows a POMDP M with three states and two observations: 
O(so) = O(s,) = El and O(s2) = ©. A reward of 1 is collected when transitioning 
from s, to s» via the (-action. All other rewards are zero. 
The policy that always selects a at sg and 5 at sı 
maximizes the expected total reward to reach G = {s2} 
but is not observation-based. The observation-based policy 
that for the first n € N transition steps selects a and then 
selects 3 afterwards yields an expected total reward of 
1— (1/2)". With n — oo we obtain ERWiR(O{s2}) =1. Figure 1. POMDP M 


As computing maximal expected rewards exactly in POMDPs is undecidable 
[34], we aim at under-approximating the actual value ERR (OG). This allows 
us to answer our problem negatively if the computed lower bound exceeds A. 


Remark 2. Expected rewards can be used to describe reachability probabilities 
by assigning reward 1 to all transitions entering G and assigning reward 0 to 
all other transitions. Our approach can thus be used to obtain lower bounds on 
reachability probabilities in POMDPs. This also holds for almost-sure reachability 
(i.e. “is the reachability probabilty one?"), though dedicated methods like those 
presented in [17,16,27] are better suited for that setting. 
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2.2 Beliefs 


The semantics of à POMDP M are captured by its (fully observable) belief 
MDP. The infinite state space of this MDP consists of beliefs [3,44]. A belief is a 
distribution over the states of the POMDP where each component describes the 
likelihood to be in à POMDP state given a history of observations. We denote the 
set of all beliefs for M by Bm := (b € Dist(S) | Vs, s' € supp(b) : O(s) = O(s')) 
and write O(b) € Z for the unique observation O(s) of all s € supp(b). 

The belief MDP of M is constructed by starting in the belief corresponding 
to the initial state and computing successor beliefs to unfold the MDP. Let 
P(s, a, z) := s es[O(s) = z]: P(s, a, 8’) be the probability to observe z € Z 
after taking action a in POMDP state s. Then, the probability to observe z 
after taking action o in belief b is P(b, o, z) :— Y ses b(s) - P(s, a, z). We refer 
to [bla, z] € Bm—the belief after taking o in b, conditioned on observing z—as 
the a-z-successor of b. If P(b, a, z) > 0, it is defined component-wise as 


[O(s) = z] - 32, cs b(s') - P(s’, a, s) 
[bla, z] (s) = P(b, a, z) 
for all s € S. Otherwise [bla, z] is undefined. 
Definition 6 (Belief MDP). The belief MDP of M is the MDP bel(M) = 
(Bm, Act, PP, bas), where Bm is the set of all beliefs in M, Act is as for M, 
Dinit :— {Sinit > 1} ds the initial belief, and PP: Bm x Act x Bm — [0,1] is the 
belief transition function with 


P(bo,z) ifb = fola, z], 


0 otherwise. 


P? (b, a,b’) := | 


We lift à POMDP reward structure R to the belief MDP [25]. 


Definition 7 (Belief Reward Structure). For beliefs b, b' € Bm and action 
a € Act, the belief reward structure RP based on R associated with bel(M) is 
given by 


2 .esb(s) 3 s eslO (s) = O(P)] - R(s, a, s): P(s,a, 8’) 
P(b, a, O(b/)) ` 


Given a set of goal states G C S, we assume—for simplicity—that there is a set 
of observations Z' C Z such that s € G iff O(s) € Z'. This assumption can always 
be ensured by transforming the POMDP M. See the full technical report [10] for 
details. The set of goal beliefs for G is given by Gg := {b € Bm | supp(b) C G}. 

We now lift the computation of expected rewards to the belief level. Based on 
the well-known Bellman equations [5], the belief MDP induces a function that 
maps every belief to the expected total reward accumulated from that belief. 


Definition 8 (POMDP Value Function). For b € By, the n-step value 
function V, : Bm > R of M is defined recursively as Vo(b) := 0 and 


RP (b, a,b’) :— 


-— ; B AM B / 1 
Vn(b) = [bé Gs]: max — $; P”(b, a,b!) - (R? (b,a, b’) + Vna(0)) 
b' € post™i(M) (b. aj) 
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R'*:0 1 


Figure 2. Belief MDP bel(M) of POMDP M from Fig. 1 


The (optimal) value function V* : Bm > R?? is given by V*(b) :— limno Vn (b). 


The n-step value function is piecewise linear and convex [44]. Thus, the optimal 
value function can be approximated arbitrarily close by a piecewise linear convex 
function [47]. The value function yields expected total rewards in M and bel(M): 


ERM R(s FOG) = ERG re ({s > 1) E 0Gg) = V*({s- 1}). 


Example 2. Fig. 2 shows a fragment of the belief MDP of the POMDP from 
Fig. 1. Observe ERI), RE (O {s2 1}) 2 1. 


We reformulate our problem statement to focus on the belief MDP. 


Problem 2 (equivalent to Problem 1). For a POMDP .M, reward structure R, 
goal states G C S, and threshold A € R, decide whether V*([s;44; — 1]) € A. 


As the belief MDP is fully observable, standard results for MDPs apply. However, 
an exhaustive analysis of bel(M) is intractable since the belief MDP is—in 
general—infinitely large?. 


3 Finite Exploration Under-Approximation 


Instead of approximating values directly on the POMDP, we consider approx- 
imations of the corresponding belief MDP. The basic idea is to construct a 
finite abstraction of the belief MDP by unfolding parts of it and approximate 
values at beliefs where we decide not to explore. In the resulting finite MDP, 
under-approximative expected reward values can be computed by standard model 
checking techniques. We present two approaches for abstraction: belief cut-offs 
and belief clipping. We incorporate those techniques into an algorithmic framework 
that yields arbitrarily tight under-approximations. 
The technical report [10] contains formal proofs of our claims. 


? The set of all beliefs—i.e. the state space of bel(M)—is uncountable. The reachable 
fragment is countable, though, since each belief has at most |Z| many successors. 
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1 cut 


R’:0 


Figure 3. Applying belief cut-offs to the belief MDP from Fig. 2 


3.1 Belief Cut-Offs 


The general idea of belief cut-offs is to stop exploring the belief MDP at certain 
beliefs—the cut-off beliefs—and assume that a goal state is immediately reached 
while sub-optimal reward is collected. Similar techniques have been discussed in 
the context of fully observable MDPs and other model types [11,26,49,2]. Our 
work adapts the idea of cut-offs for POMDP over-approzimations described in [8] 
to under-approximations. The main idea of belief cut-offs shares similarities with 
the SARSOP [30] and Goal-HSVI [24] approaches. While they apply cut-offs on 
the level of the computation tree, our approach directly manipulates the belief 
MDP to yield a finite model. 

Let V: Bm — R” with V(b) € V*(b) for all b € Bm. We call V an under- 
approximative value function and V(b) the cut-off value of b. In each of the cut-off 
beliefs b, instead of adding the regular transitions to its successors, we add a 
transition with probability 1 to a dedicated goal state becut. In the modified reward 
structure R’, this cut-off transition is assigned a reward? of V(b), causing the 
value for a cut-off belief b in the modified MDP to coincide with V(b). Hence, 
the exact value of the cut-off belief—and thus the value of all other explored 
beliefs—is under-approximated. 


Example 3. Fig. 3 shows the resulting finite MDP obtained when considering 
the belief MDP from Fig. 2 with single cut-off belief b = [so +> 1/4, sı — 3/4}. 


Computing cut-off values. The question of finding a suitable under-approximative 
value function V is central to the cut-off approach. For an effective approximation, 
such a function should be easy to compute while still providing values close 
to the optimum. If we assume a positive reward structure, the constant value 
0 is always a valid under-approximation. A more sophisticated approach is to 
compute suboptimal expected reward values for the states of the POMDP using 
some arbitrary, fixed observation-based policy o € XM. Let U7 : S — R& 
such that for all s € S, U?(s) = ERA (s F OG). Then, we define the function 
W : By >R” as 8? (b) :2 $7 b(s) - U? (s). 


s€supp(b) 


3 We slightly deviate from Def. 4 by allowing transition rewards to be —oo or +00. 
Alternatively, we could introduce new sink states with a non-zero self-loop reward. 
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Lemma 1. U is an under-approximative value function, i.e. for all b € By: 


4(b:- M b(s.-U*(s) < V*(b). 


s€supp(b) 


Thus, finding a suitable under-approximative value function reduces to finding 
u » tet : as : 
good” policies for M, e.g. by using randomly guessed fm-policies, machine 
learning methods [13], or a transformation to a parametric model [28]. 


3.2 Belief Clipping 


The cut-off approach provides a universal way to construct an MDP which under- 
approximates the expected total reward value for a given POMDP. The quality 
of the approximation, however, is highly dependent on the under-approximative 
value function used. Furthermore, regions where the belief MDP slowly converges 
towards a belief may pose problems in practice. 

As a potential remedy for these problems, we propose a different concept 
called belief clipping. Intuitively, the procedure shifts some of the probability mass 
of a belief b in order to transform b to another belief b. We then connect b to b in 
a way that the accuracy of our approximation of the value V*(b) depends only 
on the approximation of V*(b) and the so-called clipping value—some notion of 
distance between b and b that we discuss below. We can thus focus on exploring 
the successors of b to obtain good approximations for both beliefs b and b. 


Definition 9 (Belief Clip). For b€ By, we call u: supp(b) — [0,1] a belief 
clip if Vs € supp(b): p(s) € b(s) and » (Hu) = Di .e.upp() H(S) < 1. The belief 
(bS u) € By induced by u is defined by 


b(s) — u(s) 
Vs € supp(b): (bOp)(s) :— ——————. 
1- Yu) 
Intuitively, a belief clip p for b describes for each s € supp(b) the probability 
mass that is removed ("clipped away") from b(s). The induced belief is obtained 
when normalising the resulting values so that they sum up to one. 


Example 4. For belief b = {so ++ 1/4, s1 — 3/4}, consider the two belief clips 
La = {89 +> 1/4, s1 > 1/4} and po = {so > 1/4, s1 + 0}. Both induce the same 
belief: (b © u1) = (b © u2) = (so > 0,51 4 1]. 


We have supp((b© u)) € supp(b), which also implies O((b © u)) = O(b). Given 


some candidate belief b, consider the set of inducing belief clips: 


C(b, b) :— fn: supp(b) — [0,1] | 1; is a belief clip for b with b = (b © 2) f 


Belief b is called an adequate clipping candidate for b iff C(b, b) 4 0. 


Definition 10 (Clipping Value). Forb € By and adequate clipping candidate 
b, the clipping value is A, ,; :— » /(9, ,;), where ó, ,; = arg min eco, i) ? (4). 
The values à, ,;(s) for s € supp(b) are the state clipping values. 


Under-Approximating Expected Total Rewards in POMDPs 31 


Figure 4. Applying belief clipping to the belief MDP from Fig. 2 


Given a belief b and an adequate clipping candidate b, we outline how the notion 
of belief clipping is used to obtain valid under-approximations. We assume b + b, 
implying 0 < A, ,; < 1. Instead of exploring all successors of b in bel( M), the 
approach is to add a transition from b to b. The newly added transition has 
probability 1 — A, ,; and gets assigned a reward of 0. The remaining probability 
mass (ie. A, ,;) leads to a designated goal state beut. To guarantee that—in 
general—the clipping procedure yields a valid under-approximation, we need to 
add a corrective reward value to the transition from b to bey. Let £ : S — IR?? 
which maps each POMDP state to its minimum expected reward in the underlying, 
fully observable MDP M of Mt, i.e. (s) = ERY ats = 0G). This function 
soundly under-approximates the state values which can be achieved by any 
observation-based policy. It can be generated using standard MDP analysis. 
Given state clipping values ó, ,;(s) for s € supp(b), the reward for the transition 


from b to beut is DE (0 5 (5)/ 2 55) ! (s). 


Example 5. For the belief MDP from Fig. 2, belief b = (so ++ 1/4, sı + 3/4}, 
and clipping candidate b = (sg — 0, s; — 1) we get A, = 1/4, as ð, = 
l2 = {so + 1/4, sı +> 0} with the belief clip 2 as in Example 4. Furthermore, 
£(so) = 0. The resulting MDP following our construction above is given in Fig. 4. 


'The following lemma shows that the construction yields an under-approximation. 


x Ó, 5 
Lemma 2. (1— A, ,j)- V'*(b) + Aj, 5 pa -L(s) < V*(b). 
s€supp(b) 


Proof (sketch). To gain some intuition, consider the special case, where A, ,; = 
0, i (s) = b(s) for some s € supp(b). The clipping candidate b can be interpreted 
as the conditional probability distribution arising from distribution b given that 
s is not the current state. The value V*(b) can be split into the sum of (i) the 
probability that s is not the current state times the reward accumulated from 
belief b and (ii) the probability that s is the current state times the reward 
accumulated from s, i.e. from the belief (s + 1}. However, for the two summands 


^ When rewards are negative, we might have £(s) = —oo for many s € SVG in which 
case the applicability of the clipping approach is very limited. 
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we must consider a policy that does not distinguish between the beliefs b, b, and 
(s — 1) as well as their observation-equivalent successors. In other words, the 
same sequence of actions must be executed when the same observations are made. 

We consider such a policy that in addition is optimal at b, i.e. the reward 
accumulated from b is equal to V*(b). For the reward accumulated from (s — 1}, 
£(s) provides a lower bound. Hence, (1 — b(s)) - V*(b) + b(s) - £(s) is a lower 
bound for the reward accumulated from 6. A formal proof is given in [10]. 


'To find a suitable clipping candidate for a given belief b, we consider a finite 
candidate set B C Bm consisting of beliefs with observation O(b). These beliefs 
do not need to be reachable in the belief MDP. The set can be constructed, e.g. 
by taking already explored beliefs or by using a fixed, discretised set of beliefs. 

We are interested in minimising the clipping value A,» over all candidate 
beliefs b € $8. A naive approach is to explicitly compute all clipping values for all 
candidates. We are using mixed-integer linear programming (MILP) [41] instead. 
An MILP is a system of linear inequalities (constraints) and a linear objective 
function considering real-valued and integer variables. A feasible solution of the 
MILP is a variable assignment that satisfies all constraints. An optimal solution 
is a feasible solution that minimises the objective function. 


Definition 11 (Belief Clipping MILP). The belief clipping MILP for belief 
b € Bm and finite set of candidates B C (t € Bm | O(0') = O(b)} is given by: 


minimise A such that: 


5 ay —1 > Select exactly one candidate b' (1) 

b'EB 
W € B: ay € {0,1} 2) 
5 s = A > Compute clipping value for selected b’ (3) 

s€supp(b) 
Vs € supp(b): ôs € [0, b(s)] 4) 
| v e: à, > b(s) — (1 — A) Vis) — (1 = ay) 5) 


The MILP consists of O(|supp(b)| + |B|) variables and O(|supp(b)| - |B|) con- 
straints. For b’ € 8, the binary variable ay indicates whether b’ has been chosen 
as the clipping candidate. Moreover, we have variables ôs for s € supp(b) and a 
variable A to represent the (state) clipping values for b and the chosen candidate 
b’. Constraints 1 and 2 enforce that exactly one of the ay variables is one, i.e. 
exactly one belief is chosen. Constraint 3 forces A to be the sum of all state 
clipping values. 6, variables get a value between zero and b(s) (Constraint 4). 
Constraint 5 only affects 6, if the corresponding belief is chosen. Otherwise, ay 
is set to 0 and the value on the right-hand side becomes negative. If a belief 
b' is chosen, the minimisation forces Constraint 5 to hold with equality as the 
right-hand side is greater or equal to 0. Assuming A is set to a value below 1, we 
obtain a valid clipping values as 


Vs € supp(b): ó,—b(s—(1—4)-U(s <= VW(s)-———— 
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Input :POMDP M = (M,Z,O) with M = (S, Act, P, Sinit), reward 
structure R, goal states G C S, under-approx. value function V, 
function £ : S + R^ with £(s) = ERṣFR (s = OG) 

Output : Clipping belief MDP Km and reward structure RA 

1 SE {binit, beut } with bini: = {Sinit — 1} and a new belief state beut 


2 P^ (bo, cut, beut) — 1, RF (beut, cut, beut) + 0 // add self-loop 
3 Q — (bia) // initialize exploration set 
4 while Q # (0 do 

5 b + chooseBelief(Q), Q + Q \ {b} /// pop next belief to explore from Q 
6 if supp(b) C G then PF (b, goal, b) + 1, R*(b,goal,b) + 0 // add self-loop 
7 else if exploreBelief(b) then // expand b 
8 foreach o € Act(b) do // Using bel(M) and RP as in Defs. 6 and 7 
9 foreach b' € post?" (b, a) do 

10 PF (b, o, b/) — PP (b, a,b’), RF (b, o, b) — RP? (b, a,b’) 

11 | if b d S^ then S* + SE U {b}, Qe QU {b} 

12 else // apply cut-off and clipping to b 

13 PX (b, cut, beut) — 1, RX (b, cut, beut) € V(b) — // add cut-off transition 

14 choose a finite set B C Bm of clipping candidates for b 

15 b, A, iis 55-5; | solveClippingMILP(b, 8) 

16 if b Æ b and b is adequate then // Clip b using b 
17 P* (b, clip, b) — (1—A, ,;), PF (b, clip, bout) — A, ,z 

18 R' (b, clip, b) — 0, RX (b. clip, beut) €- sesuo) wee - (s) 

19 if b d S* then S* + SF U {b}, Qe QU {b} 


20 return Ky = oo Act & (goal, cut, clip) , P^, binit) and RX 


Algorithm 1: Belief exploration algorithm with cut-offs and clipping 


A trivial solution of the MILP is always obtained by setting ay and A to 1 and 
ôs to b(s) for all s and an arbitrary b’ € $8. This corresponds to an invalid belief 
clip. However, as we minimise the value for A, we can conclude that no belief in 
the candidate set is adequate for clipping if A is 1 in an optimal solution. 


Theorem 1. An optimal solution to the belief clipping MILP for belief b and 
candidate set B sets az to 1 and A to a value below 1 iff b € B is an adequate 
clipping candidate for b with minimal clipping value. 


3.3 Algorithm 


We incorporate belief cut-offs and belief clipping into an algorithmic framework 
outlined in Algorithm 1. As input, the algorithm takes an instance of Problems 1 
and 2, i.e. a POMDP M with reward structure R and goal states G. In addition, 
the algorithm considers an under-approximative value function V (Sect. 3.1) and 
a function £ for the computation of corrective reward values (Sect. 3.2). 

Lines 1 and 2 initialise the state set S^ of the under-approximative MDP Km 
with the initial belief bini and the designated goal state beut which has only one 
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transition to itself with reward 0. Furthermore, we initialise the exploration set 
Q by adding binit (Line 3). During the computation, Q is used to keep track of 
all beliefs we still need to process. We then execute the exploration loop (Lines 4 
to 19) until Q becomes empty. In each exploration step, a belief b is selected? 
and removed from Q. There are three cases for the currently processed belief b. 

If supp(b) C G, i.e. b is a goal belief, we add a self-loop with reward 0 to b 
and continue with the next belief (Line 6). b is not expanded as successors of 
goal beliefs will not influence the result of the computation. 

If b is not a goal belief, we use a heuristic function? exploreBelief to decide 
if b is expanded in Line 7. Lines 8 to 11 outline the expansion step. The transitions 
from b to its successor beliefs and the corresponding rewards as in the original 
belief MDP (see Sect. 2.2) are added. Furthermore, the successor beliefs that 
have not been encountered before are added to the set of states S^ and the 
exploration set Q. 

If b is not expanded, we apply the cut-off approach and the clipping approach 
to b in Lines 12 to 19. In Line 13 we add a cut-off transition from b to beut with 
a new action cut. We use the given under-approximative value function V to 
compute the cut-off reward. Towards the clipping approach, a set of candidate 
beliefs is chosen and the belief clipping MILP for 6 and the candidate set is 
constructed as described in Def. 11 (Lines 14 and 15). If an adequate candidate b 
with clipping values A, ,; and ó, ,;(s) for s € supp(b) has been found, we add 
the transitions from b to beut and to b using a new action clip and probabilities 
A,_,; and 1 — A, ,;, respectively. Furthermore, we equip the transitions with 
reward values as described in Sect. 3.2 using the given function £ (Lines 16 to 18). 
If the clipping candidate b has not been encountered before, we add it to the 
state space of the MDP and to the exploration set in Line 19. 

The result of the algorithm is an MDP Km with reward structure R^. The 
set of states S^ of Km contains all encountered beliefs. To guarantee termination 
of the algorithm, the decision heuristic exploreBelief has to stop exploring 
further beliefs at some point. Moreover, the handling of clipping candidates in 
Line 19 should not add new beliefs to Q infinitely often. We therefore fix a finite 
set of candidate beliefs B# C By, and make sure that the candidate sets B in 
Line 14 satisfy (B V S*) C B#. To ensure a certain progress in the exploration 
"clip-cycles"—i.e. paths of the form b; clip ... clip bn clip b;—are avoided in Km. 
This can be done, e.g. by always expanding the candidate beliefs b € B#. 

Expected total rewards until reaching the extended set of goal beliefs Gout :— 
Gg U {bout} in Cm under-approximate the values in the belief MDP: 


Theorem 2. For all beliefs b € S* \ {beut} it holds that 
ERC Re (b E OG cut) < V" (5) = ER uy ne (b E OG). 


Corollary 1. ERGY gx(0G.u) € ERR (OG). 


5 For example, Q can be implemented as a FIFO queue. 

6 The decision can be made for example by considering the size of the already explored 
state space such that the expansion is stopped if a size threshold has been reached. 
More involved decision heuristics are subject to further research. 
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Table 1. Results for benchmark POMDPs with maximisation objective 


Benchmark Data PRISM STORM 
Model 46$ S/Act/Z Cut-Off Cut-Off + Clipping Over- 
Only | n=2 | n=3 | n=4 | n=6 ||Approx. 
Drone 1226|| TO / MO] >0.79|> 0.79 x 0.94 
41 Prax 2954 «1s| 1360s| TO| TO| TO 
384 3.104! 3-104 
Drone 1226| TO / MO|| >0.86/>0.91/>0.92 «0.97 
42 Pmax 2954 «1s| 249s| 1902s) TO] TO 
761 2-10*} 2-104] 2-104 
Grid-av 17|| [0.21, 1.0]]] >0.86]>0.93]> 0.93]>0.93]>0.93]] <0.98 
4o Pmax 59 5.14s «1s| «1s| 1.77s| 3.63s| 13.9s 
4 n =6 238| 312| 472| 663| 1300 
Grid-av 17|| [0.21, 1.0]|| >0.82|>0.85|> 0.82|7 0.85 «X 0.99 
4-0.1 Pass 59 1.47s « 1s| 26.1s| 198s| 1913s TO 
: 4 n =3 238| 317| 461| 759 
Netw-p 2307] [557,557]|| > 537] > 537| > 537| > 537| > 537|] < 558 
2-8-20 Pax 3-10ł 2355s 2.3s| 98.5s| 320s| 651s| 2368s 
ars 4 5 5 5 5 
4909 n —10| 8107! 1-10°] 1-10? | 1-10?| 1-10 
Netw-p 2305| TO / MO|| 2769| 7 769 «819 
3-8-20 Pme 3-105 290s 6640s] TO] TO! TO 
2-104 1-109| 1.108 
Refuel 208]|[0.67, 0.72]]| > 0.67]> 0.67]> 0.67| 2 0.67|> 0.67]| «0.69 
06 Poux 565 4625s <1s|] 5.89s| 24.3s 92s| 2076s 
50 n =3 4576| 4834| 5204| 5603| 6135 
Refuel 470|| TO / MO|| >0.45|> 0.45 <0.51 
og Pmax 1431 «1s| 839s] TO} TO| TO 
66 2-104] 2-104 


4 Experimental Evaluation 


Implementation details. We integrated Algorithm 1 in the probabilistic model 
checker STORM [23] as an extension of the POMDP verification framework 
described in [8]. Inputs are a POMDP—encoded either explicitly or using an 
extension of the PRISM language [37|—and a property specification. Internally, 
POMDPs and MDPs are represented using sparse matrices. The implementation 
supports minimisation' and maximisation of reachability probabilities, reach- 
avoid probabilities (i.e. the probability to avoid a set of bad state until a set of goal 
states is reached), and expected total rewards. In a preprocessing step, functions V 
and £ as considered in Algorithm 1 are generated. For V, we consider the function 
U7 as in Lemma 1, where c is a memoryless observation-based policy given by a 
heuristic?. For the function £, we apply standard MDP analysis on the underlying 
MDP. When exploring the abstraction MDP Km, our heuristic expands a belief iff 
[S*| < |S|-maxzez |O-1 (z)|, where |S* | is the number of already explored beliefs 
and |O ^! (z)| is the number of POMDP states with observation z. Belief clipping 
can either be disabled entirely, or we consider candidate sets B C BT , Where 
Bf:—-(beB|vseS:b(s) e (/»| ie N,0 < i < n}} forms a finite, regular grid 
of beliefs with resolution 7 € N V (0). Grid beliefs b € Bf* are always expanded. 


T For minimisation, the under-approximation yields upper bounds. 
8 The heuristic uses optimal values obtained on the fully observable underlying MDP. 
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Table 2. Results for benchmark POMDPs with minimisation objective 


Benchmark Data PRISM STORM 
Model 6$ ||S/Act/Z Cut-Off Cut-Off 4- Clipping Over- 
Only | n=2 | n=3 | n=4 | n=6 || Approx. 
Grid 17|| [4.52,4.7]| <4.78]< 4.78| € A.78| € 4.78 24.52 
4.0.1 Rmin 62 649s <1s} 15.6s| 148s] 1940s] TO 
' 3 n 210 258| 255| 255| 255 
Grid 17||[6.12, 6.31]|| <6.56|<6.56|<6.56|< 6.56 > 6.08 
4.0.3 min 62 1077s «1s| 15.8s| 148s] 1983s| TO 
3 n =10 255| 256| 256| 256 
Maze2 15|[[6.32, 6.32]|| €6.34[€ 6.34|€ 6.34|€ 6.34|€ 6.34|| > 6.32 
0.1 Rmin 54 1.79s -1s| «1s -1s| «1s| 2.02s 
8 n 210 91 90 90 90 90 
Netw 4589] [3.17, 3.2]]| «6.56|€ 6.56|€ 6.56/<6.56/<6.56]| 73.14 
2-8-20 min 6973 211s < Is] 5.31s 17.25 42.3s 167s 
1173 n —10|| 2-107] 2-107] 2-10*} 3-104] 3-10 
Netw 2-107]|[5.61, 6.79] || «11.9|€ 11.9|€ 11.9| € 11.9 26.13 
3.8.20 Pin 3-104 7133s| 3.518! 214s] 13725 4910s| TO 
2205 9 —6| 110?| 2-10°] 2.105 2.105 
Rocks 6553 «38| <387) x38| <20| <21 >20 
jo Rmin 3-10ł|| TO / MO|| 1.39s| 61.1s| 138s| 230s| 532s 
1645 3-10*] 3-10ł| 3-104] 5-10*| 6-104 
Rocks 1-107 «44| «44| «44| «26| <27 >26 
je Pmi 5.104|| TO / MO|| 3.85s| 114s| 230s| 399s| 1062s 
2761 4-10*! 4-104] 4-104] 6-10*| 1.10? 


Furthermore, we exclude clipping candidates b with 9, 3(s) > 0 for s with 
£(s) = —oo; clipping with such candidates is not useful as it induces a value of —oo. 
Expected total rewards on fully observable MDPs are computed using Sound Value 
Iteration [39] with relative precision 1076. MILPs are solved using GUROBI [21]. 


Set-up. We evaluate our under-approximation approach with cut-offs only and 
with enabled belief clipping procedure using grid resolutions 7 = 2,3,4,6. We 
consider the same POMDP benchmarks? as in [37,8]. The POMDPs are scalable 
versions of case studies stemming from various application domains. To establish 
an external baseline, we compare with the approach of [37] implemented in 
PRISM [31]. PRISM generates an under-approximation based on an optimal policy 
for an over-approximative MDP which—in contrast to STORM—means that always 
both, under- and over-approximations, have to be computed. We ran PRISM with 
resolutions 7 = 2,3,4,6,8,10 and report on the best approximation obtained. 
To provide a further reference for the tightness of our under-approximation, 
we compute over-approximative bounds as in [8] using the implementation in 
STORM with a resolution of 7 = 8. All experiments were run on an Intel® Xeon® 
Platinum 8160 CPU using 4 threads!?, 64GB RAM and a time limit of 2 hours. 


Results. 'Tables 1 and 2 show our results for maximising and minimising properties, 
respectively. The first columns contain for each POMDP the benchmark name, 


? Instances with a finite belief MDP that would be fully explored by our algorithm are 
omitted since the exact value can be obtained without approximation techniques. 

1? For our implementation, only GUROBI runs multi-threaded. PRISM uses multiple 
threads for garbage collection. 
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Figure 5. Accuracy for Drone 4-2 with different sizes of approximation MDP Km 


model parameters, property type (probabilities (P) or rewards (R)), and the 
numbers of states, state-action pairs, and observations. Column PRISM gives the 
result with the smallest gap between over- and under-approximation computed 
with the approach of [37]. For maximising (minimising) properties, our approach 
competes with the lower (upper) bound of the provided interval. The relevant 
value is marked in bold. We also provide the computation time and the considered 
resolution 7. For our implementation, we give results for the configuration with 
disabled clipping and for clipping with different resolutions 7. In each cell, we 
give the obtained value, the computation time and the number of states in the 
abstraction MDP Km. Time- and memory-outs are indicated by TO and MO. 
The right-most column indicates the over-approximation value computed via [8]. 


Discussion. The pure cut-off approach yields valid under-approximations in all 
benchmark instances—often exceeding the accuracy of the approach of [37] while 
being consistently faster. In some cases, the resulting values improve when clipping 
is enabled. However, larger candidate sets significantly increase the computation 
time which stems from the fact that many clipping MILPs have to be solved. 

For Drone 4-2, Fig. 5 plots the resulting under-approximation values (y-axis) 
for varying sizes of the explored MDP K m (x-axis). The horizontal, dashed line in- 
dicates the computed over-approximation value. The quality of the approximation 
further improves with an increased number of explored beliefs. 


5 Conclusion 


We presented techniques to safely under-approximate expected total rewards in 
POMDPs. The approach scales to large POMDPs and often produces tight lower 
bounds. Belief clipping generally does not improve on the simpler cut-off approach 
in terms of results and performance. However, considering—and optimising—the 
approach for particular classes of POMDPs might prove beneficial. Future work 
includes integrating the algorithm into a refinement loop that also considers 
over-approximation techniques from [8]. Furthermore, lifting our approach to 
partially observable stochastic games is promising. 


Data Availability. The artifact [9] accompanying this paper contains source code, 
benchmark files, and replication scripts for our experiments. 
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Abstract. Probabilistic model checking computes probabilities and ex- 
pected values related to designated behaviours of interest in Markov 
models. As a formal verification approach, it is applied to critical sys- 
tems; thus we trust that probabilistic model checkers deliver correct re- 
sults. To achieve scalability and performance, however, these tools use 
finite-precision floating-point numbers to represent and calculate prob- 
abilities and other values. As a consequence, their results are affected 
by rounding errors that may accumulate and interact in hard-to-predict 
ways. In this paper, we show how to implement fast and correct prob- 
abilistic model checking by exploiting the ability of current hardware 
to control the direction of rounding in floating-point calculations. We 
outline the complications in achieving correct rounding from higher- 
level programming languages, describe our implementation as part of 
the MODEST TOOLSET’s mcsta model checker, and exemplify the trade- 
offs between performance and correctness in an extensive experimental 
evaluation across different operating systems and CPU architectures. 


1 Introduction 


Given a Markov chain or Markov decision process (MDP [25]) model of a safety- 
or performance-critical system, probabilistic model checking (PMC) calculates 
quantitative properties of interest: the probability of (rare or catastrophic) fail- 
ures, the expected recovery time after service interruption, or the long-run aver- 
age throughput. These properties involve probabilities or expected costs/rewards 
of sets of model behaviours, and are often specified in a temporal logic like 
PCTL [16]. As a formal verification approach, users place great trust in the 
results delivered by a PMC tool such as PRISM [22], STORM [9], ePMC [15], 
or the MODEST TOOLSET’s [18] mcsta. In contrast to classical model checkers 
for functional, Boolean-valued properties specified in e.g. LTL or CTL [2], a 
probabilistic model checker is inherently quantitative: the input model contains 
real-valued probabilities and costs/rewards; PCTL makes comparisons between 
real-valued constants and probabilities; the most efficient algorithms numerically 
iterate towards a fixpoint; and the final result itself may well be a real number. 
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Often, we can restrict to rationals, which simplifies the theory and facilitates 
“exact” algorithms using arbitrary-precision rational number datatypes. These 
algorithms only work for small models (as shown in the most recent QComp 
2020 competition of quantitative verification tools [6]). In this paper, we thus 
focus on the PMC techniques that scale to large problems: those building upon 
iterative numerical algorithms, in particular value iteration (VI) [8]. We restrict 
to probabilistic reachability, i.e. calculating the probability to eventually reach a 
goal state, as this is the core problem in PMC for MDP. Embedded in the usual 
recursive CTL algorithm, it allows us to check any (unbounded) PCTL formula. 

Starting from a trivial underapproximation of the reachability probability 
for each state of the model, VI iteratively improves the value of each state 
based on its successors! values. The true reachability probabilities are the least 
fixpoint of this procedure, towards which the algorithm converges. For roughly 
a decade, PMC tools implemented VI by stopping once the relative or absolute 
difference between subsequent iterations was below a threshold e. Haddad and 
Monmege [12] showed in 2014! that this does not guarantee a difference of < € 
between the reported and the true probability, putting in question the trust 
placed in PMC tools. Then variants of VI were developed that provide sound, i.e. 
e-correct, results: interval iteration (IT) [3,5,13], sound value iteration (SVI) [26], 
and optimistic value iteration (OVI) [19]. We focus on II as the prototypical 
sound algorithm. It additionally iterates on an overapproximation; its stopping 
criterion is the difference between over- and underapproximation being < e. 

If all probabilities in an MDP are rational numbers, then the true reachability 
probability as well as all intermediate values in II are rational, too. Yet imple- 
menting II with arbitrary-precision rationals is impractical since the smaller- 
and-smaller differences between intermediate values end up using excessive com- 
putation time and memory. II is thus implemented with fixed-precision (usually 
64-bit IEEE 754 double precision) floating point numbers. These, however, can- 
not represent all rationals, so operations must round to nearby representable 
values. Although II is numerically benign, consisting only of multiplications and 
additions within [0, 1], the default round to nearest, ties to even policy can cause 
II to deliver incorrect results. Wimmer et al. [29] show an example where PMC 
tools incorrectly state that a simple PCTL property is satisfied by a small Markov 
chain due to the underlying numeric difference having disappeared in rounding. 
We confirmed with current versions of PRISM, STORM, and mcsta that the prob- 
lem persists to today, even when requesting a "sound" algorithm like II. Wimmer 
et al. propose interval arithmetic to avoid such problems, cautioning that 


[...] the memory consumption will roughly double, since two numbers for 
the interval bounds have to be stored [...]. The runtime will be higher by 
a small factor, because we need to derive lower and upper bounds for the 
intervals, requiring two model checking runs per sub-formula. |29, p. 5] 


'They did not provide an implementation, and we are not aware of any to date. 


! Wimmer et al. [29] already in 2008 mention this problem in a more general setting, 
but neither give a concrete counterexample nor propose a solution tailored to PMC. 
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Our contribution. We present the first PMC implementation that computes cor- 
rect lower and upper bounds on reachability probabilities despite using floating- 
point arithmetic. We benefit from two developments since Wimmer et al.'s paper 
of 2008: First, II (published 2014) already uses intervals (though not as Wimmer 
et al. envisioned), necessarily doubling memory consumption compared to VI (as 
do SVI and OVI, so it appears an unavoidable cost of soundness). In place of 
"two model checking runs per sub-formula", we can make the two interleaved 
computations inside II safe w.r.t. rounding. Second, hardware and programming 
language support for controlling the rounding direction in floating-point opera- 
tions has improved, in particular with the AVX-512 instruction set in the newest 
x86-64 CPUs and widespread compiler support for C99's “floating-point environ- 
ment" header fenv.h. Nevertheless, it is nontrivial to achieve runtime that is 
only “higher by a small factor". For the analysis of probabilistic systems, the only 
related use of safe rounding we are aware of is in the SSMT tool SiSAT [27]. 


Structure. We recap PMC and II (Sect. 2) as well as problems and solutions re- 
lated to rounding in floating-point arithmetic in Sect. 3. We then present our new 
approach in Sect. 4, including important implementation aspects. The perfor- 
mance of our approach is crucial to its adoption in tools; thus in Sect. 5 we report 
on extensive experiments across different software and hardware configurations 
on models from the Quantitative Verification Benchmark Set (QVBS) [20]. 


2 Probabilistic Model Checking 


We write {21 +> yi,...) to denote the function that maps all x; to y;. Given a 
set S, its powerset is 2°. A (discrete) probability distribution over S is a function 
u € S — [0,1] with countable support spt(u) = (s € S | p(s) > 0} and 
J sespt(u) H(S) = 1. Dist(S) is the set of all probability distributions over 5. If 
p(s) € Q for all s € S, we call p a rational probability distribution, in Distg(S). 


Markov decision processes (MDP) [25] combine the nondeterminism of Kripke 
structures with the finite random choices of discrete-time Markov chains (DTMC). 


Definition 1. A Markov decision process (MDP) is a triple M — (S,sr, T) 
where S is a finite set of states with initial state s; € S and T: S — 2P5'o(5) 
is the transition function. T(s) must be finite and non-empty for all s € S. 


For s € S, an element u of T(s) is a transition, and if s' € spt(u), then the 
transition has a branch to successor state s' with probability p(s’). If |T(s)| = 1 
for all s € S, then M is a DTMC. 


Example 1. Fig. 1 shows our example MDP M7, which is actually a DTMC. It 
is a simplified and parametrised version of the counterexample of Wimmer et 
al. [29, Fig. 2]. It is parametrised in terms of n € N (determining the number 
of chained states with transitions labelled b) and y € (0,0.5) (changing some 
probabilities). We draw transitions as lines to an intermediate node from which 
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Fig. 1. Example parametrised MDP M7 


probability-labelled branches lead to successor states. We omit the intermediate 
node for transitions with a single branch, and label some transitions to easily 
refer to them. M7 has 4+ n states and transitions, and 7 + 2n branches. 


In practice, higher-level modelling languages like MODEST [14] are used to specify 
MDP. The semantics of an MDP is captured by its paths. A path represents a 
concrete resolution of all nondeterministic and probabilistic choices. Formally: 


Definition 2. A finite path is a sequence Tin = So Ho $1 41 --- 4n 18m where 
s; € S for alli € (0,...,n } and uj € T(si) A pi(si41) > 0 for alli € (0,...,n— 
1}. Let |ng,| = n and last(tin) = sn. Ig, (s) is the set of all finite paths starting 
in s. A path is an analogous infinite sequence 7, and II(s) is the set of all paths 
starting in s. We write s € m if di: s = si. 


A scheduler (or adversary, policy or strategy) only resolves the nondeterministic 
choices of M. For this paper, memoryless deterministic schedulers suffice [4]. 


Definition 3. A function s: S — Dist(S) is a scheduler if, for all s € S, we 
have s(s) € T(s). The set of all schedulers of M is G(M). 


We are interested in reachability probabilities. Let M|; = (S, sr, T|,) with T|,(s) = 
{s(s) } be the DTMC induced by s on M. Via the standard cylinder set con- 
struction [10, Sect. 2.2] on M|s, a scheduler induces probability measures P2^* 
on measurable sets of paths starting in s € S. 


Definition 4. For state s and goal state g € S, the maximum and minimum 
probability of reaching g from s is defined as P8(o g) = sup,eg PM ({ a E€ 


max 


II(s) | g € x }) and PM (og) = infsco P (H r € II(s) | p € x), respectively. 


min 


'The definition extends to sets G of goal states. We omit the superscript for M 
when it is clear from the context, and if we omit that for s, then s — sr. From 
now on, whenever we have an MDP with a set of goal states G, we assume w.l.o.g. 
that all g € G are absorbing, i.e. every g only has one self-loop transition. 


Definition 5. A maximal end component (MEC) of M is a maximal (sub-) MDP 
(S'. T'. s^) where S' C S, T'(s) C T(s) for all s € S', and the directed graph with 
vertex set S" and edge set ( (s, s) | 3u € T'(s): p(s’) > 0) is strongly connected. 
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1 function II(M = (S, sr, T), G, opt, c) 
// Preprocessing 
2 if opt — max then M :— CollapseMECs(M, G) // collapse MECs 
3 So :— ProbO(M, G, opt), Sı :— Prob1(M, G, opt) // identify 0/1 states 
4 l:2(s0|seSNS1)3U(so 1|se $1) // initialise lower vector 
5 u:-(íse0|seSo)jUu(seo1|seSNSo) // initialise upper vector 
// Iteration 
6 while (u(sr) — l(sr))/l(sr) > € do // while relative error > e: 
7 foreach s € S \ (So U $1) do // update non-0/1 states: 
8 l(s) = opt, ers) 32s esi) HIS’) Us’) // iterate lower vector 
9 u(s) :— opt er(s 32s esiu) IC!) : u(s’) // iterate upper vector 
10 | return i(u(sr) — l(sr)) 


Alg. 1: Interval iteration for probabilistic reachability 


2.1 Algorithms 


Interval iteration [3,5,12,13] computes reachability probabilities p(s) = P5,, (9G), 
opt € { max, min }. We show the basic algorithm as Alg. 1. It iteratively refines 
vectors | and u that map each state to a value in Q such that, at all times, we 
have ((s) € p(s) € u(s). In each iteration, the values in | and u are updated 
for all relevant states (line 7) via the classic Bellman equations of value itera- 
tion (lines 8-9). Their least fixpoint is p, towards which l converges from below. 
Some preprocessing is needed to ensure that the fixpoint is unique and also u 
converges towards p: for maximisation, we need to collapse MECS into single 
states (line 2). This can be be done via graph-based algorithms (see e.g. [7]) 
that only consider the graph structure of the MDP as in Definition 1 but do 
not perform calculations with the concrete probability values. For both max- 
imisation and minimisation, we need to identify the sets Sg and Sı such that 
Vs € So: p(s) = 0 and Vs € Sı: p(s) = Sı (line 3). This can equally done 
via graph-based algorithms [10, Algs. 1-4]. We then initialise | and u to triv- 
ial under-/overapproximations of p (lines 4-5). Iteration stops when the relative 
difference between | and u at s; is at most e (which is often chosen as 10^? or 
1079). The corresponding check in line 6 assumes that division by zero results 
in +00, as is the default in IEEE 754. By convergence of l and u towards the 
fixpoint, II terminates, and we eventually return a value p with the guarantee 
that p(sr) € [(1 — e) - 5, (1 + €) - p]. This makes II sound. 


PCTL. The temporal logic PCTL [16] allows us to construct complex branching- 
time properties. It takes standard CTL [2] and replaces the A(w) (“for all paths 
v? holds") and E(w) (“there exists a path for which v holds") operators by the 
probabilistic operator P..(w) for “under all schedulers, the probability of the 
measurable set of paths for which v holds is ~ c where ~ € {<,<,>,>} 
and c € [0,1]. To model-check a PCTL formula on MDP M, we follow the 
standard recursive CTL model checking algorithm [2, Sect. 6.4] except for the P 
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operator, which can be reduced to computing reachability probabilities. For the 
“finally”/“eventually” case P..(F $), we can directly use interval iteration: Let Sy 
be the set of states recursively determined to satisfy ¢. Call II(M, Ss, opt, €) 
of Alg. 1 with opt. = max if ~ € {<,<} and opt, = min otherwise, with two 
modifications: Change the stopping criterion of line 6 to check the difference for 
all states, and in line 10, return the set Sp (s € S | Va € [I(s),u(s)]: £x ~ c}. If 
ds € S, x € [l(s), u(s)]: £x ~ c, however, we would need to either abort and report 
an “unknown” situation, or continue with a reduced e until we can (hopefully 
eventually) decide the comparison. None of PRISM, STORM, and mcsta appear 
to perform this extra check, though. In this paper, we only use PCTL for non- 
nested top-level P,(F . ..) operators; the results are then true if s; € Sp, should be 
unknown in case the “unknown” situation applies to sz, and are false otherwise. 


3 Floating-Point Arithmetic 


The current implementations of II (in PRISM, STORM, and mcsta) use IEEE 754 
double-precision floating-point arithmetic to represent (a) the probabilities of 
the MDP's branches and (b) the values in l and u. A floating-point number is 
stored as a significand d and an exponent e w.r.t. to an agreed-upon base b such 
that it represents the value d - b^. We fix b = 2. IEEE 754 double precision uses 
64 bits in total, of which 1 is a sign bit, 52 are for d, and 11 are for e. Standard 
alternatives are 32-bit single precision (1 sign, 23 bits for d, and 8 for e) and the 
80-bit x87 extended precision format (with 1 sign bit, 64 for d, and 15 for e). The 
subset of Q that can be represented in such a representation is determined by the 


numbers of bits for d and e. For example, 4 or Z can be represented exactly in all 
1 


formats, but 75 cannot. IEEE 754 presen lies that all basic operations (addition, 
multiplication, etc.) are performed at “infinite precision” with the result rounded 
to a representable number. The default rounding mode is to round to the nearest 
such number, choosing an even value in case of ties (round to nearest, ties to 


even). In single precision, io is thus by default rounded to 


13421773 - 2?" = 0.100000001490116119384765625. 


A single rounded operation leads to an error of at most the distance between the 
two nearest representable numbers. In iterative computations, however, rounding 
may happen at every step. À striking example of the consequences is the failure of 
an American Patriot missile battery to intercept an incoming Iraqi Scud missile 
in February 1992 in Dharan, Saudi Arabia [28], which resulted in 28 fatalities. 
'The Patriot system calculated time in seconds by multiplying its internal clock's 
value by a rounded binary representation of i After 100 hours of continuous 
operation, this lead to a cumulative rounding error large enough to miscalculate 
the incoming missile’s position by more than half a kilometre [1]. 


3.1 Errors in Probabilistic Model Checking 


II accumulates and multiplies rounded floating-point values in the l and u vectors 
with potentially already-rounded values representing the rational probabilities of 
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the model. Using the default rounding mode, how can we be sure that the final 
result does not miss the true probability by more than half a kilometre, too? 

Following Wimmer et al. [29], let us consider MDP M7 of Fig. 1 again, and 
determine whether P<1 (o {s+ }) holds. The model is acyclic, so it is easy to see 
that 1 1 
p = Pinax(o { 84 }) = 2 deem 3 
Let us fix n = 1 and y = 10 9. Then p = 4 + 10718. This value cannot be 
represented in double precision, and is by default rounded to 0.5. 

We have encoded M7 in the MODEST and PRISM languages, and checked the 
answers returned by PRISM 4.7, STORM 1.6.4, and mcsta 3.1 for the property. 
The correct result would be false. PRISM returns true in its default configuration, 
which uses an unsound algorithm, and false when requesting an algorithm with 
exact rational arithmetic, for which M7 is small enough. If we explicitly request 
PRISM to use II, then the result depends on the specified e: for e > 10711, we get 
the correct result of false; for smaller e < 10^ 1?, i.e. higher precision, however, we 
incorrectly get true. STORM incorrectly returns true in its default configuration 
as well as when we request a sound algorithm via the --sound parameter. Only 
when using an exact rational algorithm via the --exact parameter does STORM 
correctly return false. mcsta, when using II (--alg Intervallteration), in- 
correctly returns true, and additionally reports that it computed [I(sr), u(sr)] 
as [0.5,0.5], thus not including the true value of p. Other algorithms are not 
immune to the problem, either; for example, mcsta also answers true when using 
SVI, OVI, and when solving the MDP as a linear programming problem via the 
Google OR Tools’ GLOP LP solver. 

'This example shows that using a sound algorithm does not guarantee correct 
results. The problem is not specific to cases of small probabilities like y = 10-9 
in the MDP; we can achieve the same effect using arbitrarily higher values of y 
if we just increase n a little. Such bounded try-and-retry chains where “normal” 
probabilities in the model result in very small values during iteration and on 
the final result are not uncommon in the systems often modelled as MDPs, 
e.g. backoff schemes in communication protocols and randomised algorithms. In 
general, tiny differences in probabilities in one place may result in significant 
changes of the overall reachability probability; for example, in two-dimensional 
random walks, the long-run behaviour when the probabilities to move forward 
or backward are both E is vastly different from if they are 4 +ô and 4 — ô, 
respectively, for any 6 > 0. 


3.2 On Precision and Rounding Modes 


In our concrete example, we may be able to avoid the problem by increasing 
precision: In the 80-bit extended format supported by all x86-64 CPUs, 2410 15 
is by default rounded to 5.000000000000000009... - 1071, so there is a chance 
of obtaining false unless other rounding during iterations would lose all the 
difference. Extended precision is used for C's long double type by e.g. the GCC 
compiler; it is thus readily accessible to programmers. It is, however, the most 
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precise format supported in common CPUs today; if we need more precision, 
we would have to resort to much slower software implementations using e.g. 
the GNU MPFR library. Any a-priori fixed precision, however, just shifts the 
problem to smaller differences, but does not eliminate it. 

The more general solution that we propose in this paper is to control the 
rounding mode of the floating-point operations performed in the II algorithm. 
In addition to the default round to nearest, ties to even mode, the IEEE 754 
standard defines three directed rounding modes: round towards zero (i.e. trun- 
cation), round towards 4-oo (i.e. always round up), and round towards —oo (i.e. 
always round down). As we will explain in Sect. 4, using the latter gives us an 
easy way to make the computations inside II safe, i.e. guarantee the under- and 
overapproximation invariants for | and u, respectively. Control of the floating- 
point rounding mode however appears to be a rarely-used feature of IEEE 754 
implementations; consequently the level and style of support for it in CPUs and 
high-level programming languages is diverse. 


3.3 CPU Support for Rounding Modes 


STORM and mcsta run exclusively on x86-64 systems (with the upcoming ARM- 
based systems so far only supported via their x86-64 emulation layers), while 
PRISM additionally supports several other platforms via manual compilation. 
'Thus we focus on x64-64 in this paper as the platform probabilistic model check- 
ers overwhelmingly run on today. 


X87 and SSE. All x64-64 CPUs support two instruction sets to perform floating- 
point operations in double precision: The x87 instruction set, originating from 
the 8087 floating-point coprocessor, and the SSE instruction set, which includes 
support for double precision since the Pentium 4's SSE2 extension. Both imple- 
ment operations according to the IEEE 754 standard. Aside from architectural 
particularities such as its stack-based approach to managing registers, the x87 
instruction set notably includes support for 80-bit extended precision. In fact, 
by default, it performs all calculations in that extended precision, only rounding 
to double or single precision when storing values back to 64- or 32-bit memory 
locations. This has the advantage of reducing the error across sequences of oper- 
ations, but for high-level languages makes the results depend on the compiler's 
choices of when to load/store intermediate values in memory vs. keeping them 
in x87 registers. The SSE instructions only support single and double precision. 

Both the x87 and SSE instruction sets support all four rounding modes men- 
tioned above. The rounding mode of operations for x87 and SSE is determined 
by the current value of the x87 FPU control word stored in the x87 FPU control 
register or the current value of the SSE MXCSR control register, respectively. 
That is, to change rounding mode, we need to obtain the current control regis- 
ter value, change the two bits determining rounding mode (with the other bits 
controlling other aspects of floating-point operations such as the treatment of 
NaNs), and apply the new value. This is done via the FNSTCW/FLDCW in- 
struction pair on x87, and VSTMXCSR/VLDMXCSR for SSE. Rounding mode 
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is thus part of the global (per-thread) state, and we must be careful to restore 
its original configuration when returning to code that does not expect rounding 
mode changes. Frequent changes of rounding mode thus incur a performance 
overhead due to the extra instructions that must be executed for every change 
and their effects on e.g. pipelining. 


AVX-512. AVX-512 is the extension to 512 bits of the sequence of single instruc- 
tion, multiple data (SIMD) instruction sets in x84-64 processors that started 
with SSE. It became available for general-purpose systems in high-end desk- 
top (Skylake-X) and server (Xeon) CPUs in 2017, but it took until the 10th 
generation of Intel's Core mobile CPUs in 2019 before it was more widely avail- 
able in end-user systems. It is supposed to appear in AMD CPUs with the 
upcoming Zen 4 architecture. Aside from its 512-bit SIMD instructions, AVX- 
512 crucially also includes new instructions for single floating-point values where 
the operation's rounding mode is specified as part of the instruction itself via 
the new “EVEX” encoding. Of particular note for implementing II are the new 
VFMADDr(rirora3)SD fused multiply-add instructions (the r; determining how 
the operand registers are used) that can directly be used for the sums of prod- 
ucts in the Bellman equations in lines 8-9 of Alg. 1. Overall, AVX-512 thus makes 
rounding mode independent of global state, and may improve performance by 
removing the need for extra instruction sequences to change rounding mode. 


3.4 Rounding Modes in Programming Languages 


Support for non-default rounding modes is lacking in most high-level program- 
ming languages. Java, C#, and Python, for example, do not support them at 
all. If II is implemented in such a language, there is consequently no hope for a 
high-performance solution to the rounding problems described earlier. 

For C and C++, the C99 and C++11 standards introduced access to the 
floating-point environment. The fenv.h/cfenv headers include the fegetround 
and fesetround functions to query the current rounding mode and change it, 
respectively. Implementations of these functions on x86-64 read/change both the 
x87 and SSE control registers accordingly. In the remainder of this paper, we fo- 
cus on a C implementation, but most statements hold for C++ analogously. The 
level of support for the C99 floating-point features varies significantly between 
compilers; it is in particular still incomplete in Clang? and GCC [11, Further 
notes]. Still, both compilers provide access to the fegetround/fesetround func- 
tions (via the associated standard libraries), but GCC in particular is not round- 
ing mode-aware in optimisations. This means that, for example, subexpressions 
that are evaluated twice, with a change in rounding mode in between, may be 
compiled by GCC into a single evaluation before the change, with the resulting 
value stored in a register and reused after the rounding mode change. This can 


? The documentation as of October 2021 states that C99 support in Clang “is feature- 
complete except for the C99 floating-point pragmas". 
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even happen when using the -frounding-math option?. Programmers thus need 
to inspect the generated assembly to ensure that no problematic transformations 
have been made, or try to make them impossible by declaring values volatile 
or inserting inline assembly "barriers". 

Overall, C thus provides a standardised way to change x87/SSE rounding 
mode, but programmers need to be aware of compiler quirks when using these 
facilities. Support for AVX-512 instructions that include rounding mode bits in 
C, on the other hand, is only slightly more convenient than programming in 
assembly as we can use the intrinsics in the immintrin.h header; there is no 
standard higher-level abstraction of this feature in either C or C++. 


4 Correctly Rounding Interval Iteration 


Let us now change II as in Alg. 1 to consistently round in safe directions at 
every numeric operation. Given that we can change or specify the rounding 
mode of all basic floating-point operations on current hardware, we expect that 
a high-performance implementation can be achieved. First, the preprocessing 
steps require no changes as they are purely graph-based. The changes to the 
iteration part of the algorithm are straightforward: In line 6, 


while (u(sr) — l(sr))/l(sr) > e do ..., 
we round the results of the subtraction and of the division towards --oo to avoid 
stopping too early. In line 8, 


I(s) = opt ers) ein a(s’) ` I(s‘), 

the multiplications and additions round towards —oo while the corresponding 
operations on the upper bound in line 9 round towards --oo. Recall that all 
probabilities in the MDP are rational numbers, i.e. representable as “2 with 
num, den € N. We assume that num and den can be represented exactly in the 
implementation. Then, in line 8, we calculate the floating-point values for the 
p(s’) = num/den by rounding towards —oo. In line 9, we round the result of the 
corresponding division towards +oo. Finally, instead of returning the middle of 
the interval in line 10, we return [l(sr), u(sr)] so as not to lose any information 
(e.g. in case the result is compared to a constant as in the example of Sect. 3.1). 

With these changes, we obtain an interval guaranteed to contain the true 
reachability probability if the algorithm terminates. However, rounding away 
from the theoretical fixpoint in the updates of | and u means that we may 
reach an effective fixpoint—where l and u no longer change because all newly 
computed values round down/up to the values from the previous iteration—at 
a point where the relative difference of l(sr) and u(sr) is still above e. This 
will happen in practice: In QComp 2020 [6], mcsta participated in the floating- 
point correct track by letting VI run until it reached a fixpoint under the default 
rounding mode with double precision. In 9 of the 44 benchmark instances that 
mcsta attempted to solve in this way, the difference between this fixpoint and 


3 The documentation as of Oct. 2021 states that -frounding-math “does not currently 
guarantee to disable all GCC optimizations that are affected by rounding mode." 
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1 function SR-SII(M = (S, sr, T), G, opt, €) 

2 ... (preprocessing as in Alg. 1)... 

3 repeat 

4 chg :— false 

5 fesetround(towards —oo) 

6 foreach s € SV (So U $1) do 

7 Inew :— Opt yer(s) Los/espe(u) PG) : 8^) // iterate lower vector 
8 if Inew A l(s) then chg := true 

9 U(s) := lnew 


esetround(towards +00) 


m 
o 
Kh 


11 foreach s € S \ (So U $1) do 

12 Unew :— Opt. er(s) 9 2s espi(u) HCS) : us’) // iterate upper vector 
13 if Unew Z u(s) then chg := true 

14 u(s) :— Unew 

15 until ^chg V (u(sr) — l(sr))/l(sr) < e 

16 return [I(sr), l(sr)] 


Alg. 2: Safely rounding sequential interval iteration (SR-SII) for x87 or SSE 


the true value was more than the specified e. With safe rounding away from the 
true fixpoint, this would likely have happened in even more cases. 

To ensure termination, we thus need to make one further change to the II of 
Alg. 1: In each iteration of the while loop, we additionally keep track of whether 
any of the updates to l and u changes the previous value. If not, we end the loop 
and return the current interval, which will be wider than the requested e relative 
difference. We refer to II with all of the these modifications as safely rounding 
interleaved II (SR-III) in the remainder of this paper. 


4.1 Sequential Interval Iteration 


When using the x87 or SSE instruction sets to implement SR-III, we need to 
insert a call to fesetround just before line 8, and another just before line 9. 
If, for an MDP with n states, we need m iterations of the while loop, we will 
make 2-n-m calls to fesetround. This might significantly impact performance 
for models with many states, or that need many iterations (such as the haddad- 
monmege model of the QVBS, which requires 7 million iterations with e = 10-9 
despite only having 41 states). As an alternative, we can rearrange the iteration 
phase of II as shown in Alg. 2: We first update / for all states (lines 6-9), then u 
for all states (lines 11-14), with the rounding mode changes in between (lines 5 
and 10). We call this variant of II safely rounding sequential II (SR-SII). It only 
needs 2-m calls to fesetround, which should improve its performance. However, 
it also changes the memory access pattern of II with an a priori unknown effect 
on performance. We write III for II to stress that it is interleaved, and SII for 
Alg. 2 without the safe rounding, in the remainder of this paper. 
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4.2 Implementation Aspects 


We have implemented III, SII, SR-III, and SR-SII in mcsta. While mcsta is writ- 
ten in C#, the new algorithms are (necessarily) written in C, called from the 
main tool via the P/Invoke mechanism. We used GCC 10.3.0 to compile our 
implementations on both 64-bit Linux and Windows 10. We manually inspected 
the disassembly of the generated code to ensure that GCC's optimisations did 
not interfere with rounding mode changes as described in Sect. 3.4. In a sig- 
nificant architectural change, we modified mcsta's state space exploration and 
representation code to preserve the exact rational values for the probabilities 
specified in the model, so that safely-rounded floating-point representations for 
the p(s’) can be computed during iteration as described above. 

Of each algorithm, we implemented four variants: a default one that leaves the 
choice of instruction set to the compiler and uses fesetround to change round- 
ing mode; an 287 variant that forces floating-point operations to use the x87 
instructions by attributing the relevant functions with target ("fpmath-387") 
and that changes rounding mode via inline assembly using FNSTCW/FLDCW; 
an SSE variant that forces the SSE instruction set via target ("fpmath-sse") 
and uses VSTMXCSR/VLDMXCSR in inline assembly for rounding mode chan- 
ges; and an AVX-512 variant that implements all floating-point operations re- 
quiring non-default rounding modes via AVX-512 intrinsics, in particular using 
.mm fmadd round sd in the Bellman equations. All variants use double pre- 
cision; default and SSE additionally have a single-precision version (which we 
omit for x87 since the reduced precision does not speed up the operations we 
use); and z87 also provides an 80-bit extended-precision version (however we 
currently return its results as safely-rounded double-precision values due to the 
unavailability of a long double equivalent in C#, which limits its use outside of 
performance testing for now). All in all, we thus provide 28 variants of interval 
iteration for comparison, out of which 14 provide guaranteed correct results. 

In particular, the safe rounding makes PMC feasible at 32-bit single precision, 
which would otherwise be too likely to produce incorrect results. While we expect 
that this may deliver many results with low precision (but which are correct) due 
to a rounded fixpoint being reached long before the relative width reaches e, it 
also halves the memory needed to store l and u, and may speed up computations. 
At the opposite end, mcsta is now also the first PMC tool that can use 80-bit 
extended precision, which however doubles the memory needed for / and u since 
80-bit long double values occupy 16 bytes in memory (with GCC). 


5 Experiments 


Using our implementation in mcsta, we first tested all variants of the algorithms 
on M7? in the setting of Sect. 3.1. As expected, and validating the correctness of 
the approach and its implementation, all SR variants return unknown. 

We then assembled a set of 31 benchmark instances—combinations of a 
model, values for its configurable parameters, and a property to check—from 


Correct Probabilistic Model Checking with Floating-Point Arithmetic 53 


the QVBS covering DTMC, MDP, and probabilistic timed automata (PTA) [24] 
transformed to MDP by mcsta using the digital clocks approach [23]. These are 
all the models and probabilistic reachability probabilities from the QVBS sup- 
ported by mcsta for which the result was not 0 or 1 (then it can be computed via 
graph-based algorithms) and for which a parameter configuration was available 
where PMC terminated within our timeout of 120s but II needed enough time for 
it to be measured reliably (Z 0.2s). We checked each of these benchmarks with 
all 28 variants of our algorithms using e = 10-9 on different x86-64 systems: 
Iliw: an Intel Core i5-1135G7 (up to 4.2 GHz) laptop running Windows 10, 
this being the only system we had access to with AVX-512 support; AMDw: 
an AMD Ryzen 9 5900X (3.7-4.8 GHz) workstation running Windows 10, repre- 
senting current AMD CPUs in our evaluation; I4x: an Intel Core i7-4790 (3.6- 
4.0 GHz) workstation running Ubuntu Linux 18.04, representing older-generation 
Intel desktop hardware; and IPx: an Intel Pentium Silver J5005 (1.5-2.8 GHz) 
compact PC running Ubuntu Linux 18.04, representing a non-Core low-power 
Intel system. We show a selection of our experimental results in the remainder 
of this section, mainly from I11w and AMDw. We remark on cases where the 
other systems (all with Intel CPUs) showed different patterns from Illw. 


We present results graphically as scatter plots like in Fig. 2. Each such plot 
compares two algorithm variants in terms of runtime for the iteration phase of the 
algorithm only (i.e. we exclude the time for state space exploration and prepro- 
cessing). Every point (x, y} corresponds to a benchmark instance and indicates 
that the variant noted on the x-axis took x seconds to solve this instance while 
the one noted on the y-axis took y seconds. Thus points above the solid diagonal 
line correspond to instances where the x-axis method was faster; points above 
(below) the upper (lower) dotted diagonal line are where the x-axis method took 
less than half (more than twice) as long. 


Fig. 2 first shows the performance impact of enabling safe rounding for the 
standard interleaved algorithm using double precision. The top row shows the 
behaviour on I11w. We see that runtime is drastically longer in the default variant 
that uses fesetround, but only increases by a factor of around 2 if we use 
the specific inline assembly instructions. We note that GCC includes the code 
for fesetround in the generated .d11 file on Windows, but in contrast to the 
assembly methods does not inline it into the callers. Some of the difference 
may thus be function call overhead. The middle row shows the behaviour on 
AMDw. Here, default is affected just as badly, but the effect on SEE is worse 
while that on zó7 is much lower than on the Intel I11w system. In the bottom 
row, we show the impact on default on the Linux systems (bottom left and 
bottom middle), which is much lower than on Windows. This is despite GCC 
implementing fesetround as an external library call here. The overhead still 
markedly differs between the two Intel CPUs, though. Finally, as expected, we 
see on the bottom right than safe rounding has almost no performance impact 
when using the AVX-512 instructions. 


Seeing the significant impact enabling safe rounding can have, we next show 
what the sequential algorithm brings to the table, in Fig. 3. On the top left, we 
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Fig. 2. Performance impact of safe rounding across instruction sets and systems 


compare the base algorithms without safe rounding, where SII takes up to twice 
as long in the worst case. This is likely due to the more cache-friendly memory 
access pattern of III: we store | and u interleaved for III, so it always operates 
on two adjacent values at a time. The bottom-left plot confirms that reducing 
the number of rounding mode changes reduces the overhead of safe rounding to 
essentially zero. The remaining four plots show the differences between SR-III 
and SR-SII. In all cases except z67 on AMDw, SR-III is slower. We thus have 
that III is fastest but unsafe, SII and SR-SII are equally fast but the latter is 
safe, and SR-III is safe but tends to be slower on the Intel systems. On the AMD 
system, SR-III surprisingly wins over SR-SII with x87, highlighting that the x87 
instruction set in Ryzen 3 must be implemented very differently from SSE. 
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Fig. 3. Performance of interleaved compared to sequential II 
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We further investigate the impact of the instruction set in Fig. 4. Confirming 
the patterns we saw so far, SSE is slightly faster than x87 on I11w (and we see 
similar behaviour on the other Intel systems) but slower by a factor of more 
than 2 on the AMD CPU. The rightmost plot highlights that AV X-512 is the 
fastest alternative on the most recent Intel CPUs, which may in part be due to 
the availability of the fused multiply-add instruction that fits II so well. 

All results so far were for double-precision computations. To conclude our 
evaluation, we show in Fig. 5 that reducing to single precision does not bring 
the expected performance benefits. We see in the leftmost plot that the overhead 
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Fig. 5. Performance with different precision settings (on I11w) 


of safe rounding has a much higher variance compared to Fig. 2. The detailed tool 
outputs hint at the reason being that rounding away from the fixpoint occurs in 
much larger steps with single precision, which significantly slows down or stops 
the convergence in several instances. The middle plot shows that, aside from the 
slowly converging outliers, using single precision does not provide a speedup over 
using doubles. Finally, on the right, we show that the impact of enabling 80-bit 
extended precision on x87 is minimal. 


6 Conclusion 


'There has been ample research into sound PMC algorithms over the past years, 
but the problem of errors introduced by naive implementations using default 
floating-point rounding has been all but ignored. We showed that a solution ex- 
ists that, while perhaps conceptually simple, faces a number of implementation 
and performance obstacles. In particular, hardware support for rounding modes 
is arguably essential to achieve acceptable performance, but difficult to use from 
C/C++ and impossible to access from most other programming languages. We 
extensively explored the space of implementation variants, highlighting that per- 
formance crucially depends on the combination of the variant, the CPU, and the 
operating system. Nevertheless, our results show that truly correct PMC is pos- 
sible today at a small cost in performance, which should all but disappear as 
AVX-512 is more widely adopted. With our implementation in mcsta, we provide 
the first PMC tool that combines fast, scalable, and correct. 
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Abstract. Game-theoretic techniques and equilibria analysis facilitate 
the design and verification of competitive systems. While algorithmic 
complexity of equilibria computation has been extensively studied, prac- 
tical implementation and application of game-theoretic methods is more 
recent. Tools such as PRISM-games support automated verification and 
synthesis of zero-sum and (e-optimal subgame-perfect) social welfare 
Nash equilibria properties for concurrent stochastic games. However, 
these methods become inefficient as the number of agents grows and may 
also generate equilibria that yield significant variations in the outcomes 
for individual agents. We extend the functionality of PRISM-games to 
support correlated equilibria, in which players can coordinate through 
public signals, and introduce a novel optimality criterion of social fair- 
ness, which can be applied to both Nash and correlated equilibria. We 
show that correlated equilibria are easier to compute, are more equitable, 
and can also improve joint outcomes. We implement algorithms for both 
normal form games and the more complex case of multi-player concur- 
rent stochastic games with temporal logic specifications. On a range of 
case studies, we demonstrate the benefits of our methods. 


1 Introduction 


Game-theoretic verification techniques can support the modelling and design of 
systems that comprise multiple agents operating in either a cooperative or com- 
petitive manner. In many cases, to effectively analyse these systems we also need 
to adopt a probabilistic approach to modelling, for example because agents oper- 
ate in uncertain environments, use faulty hardware or unreliable communication 
mechanisms, or explicitly employ randomisation for coordination. 

In these cases, probabilistic model checking provides a convenient unified 
framework for both formally modelling probabilistic multi-agent systems and 
specifying their required behaviour. In recent years, progress has been made in 
this direction for several models, including turn-based and concurrent stochastic 
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games (TSGs and CSGs), and for multiple temporal logics, such as rPATL [10] 
and its extensions [24]. Tool support has been developed, in the form of PRISM- 
games [22], and successfully applied to case studies across a broad range of areas. 

Initially, the focus was on zero-sum specifications [24], which can be natural 
for systems whose participants have directly opposing goals, such as the defender 
and attacker in a security protocol minimising or maximising the probability of 
a successful attack, respectively. However, agents often have objectives that are 
distinct but not directly opposing, and may also want to cooperate to achieve 
these objectives. Examples include network protocols and multi-robot systems. 

For these purposes, Nash equilibria (NE) have also been integrated into prob- 
abilistic model checking of CSGs [24], together with social welfare (SW) opti- 
mality criterion, resulting in social welfare Nash equilibria (SWNE). An SWNE 
comprises a strategy for each player in the game where no player has an incen- 
tive to deviate unilaterally from their strategy and the sum of the individual 
objectives over all players is maximised. 

One key limitation of SWNE, however, is that, as these techniques are ex- 
tended to support larger numbers of players [21], the efficiency and scalability 
of synthesising SWNE is significantly reduced. In addition, simply aiming to 
maximise the sum of individual objectives may not produce the best perform- 
ing equilibrium, either collectively or individually; for example, they can offer 
higher gains for specific players, reducing the incentive of the other players to 
collaborate and instead motivating them to deviate from the equilibrium. 

In this paper, we adopt a different approach and introduce, for the first time 
within formal verification, both social fairness as an optimality criterion and 
correlated equilibria, and the insights required to make these usable in practical 
applications. Social fairness (SF) is particularly novel, as it is inspired by similar 
concepts used in economics and distinct from the fairness notions employed in 
verification. Correlated equilibria (CE) [3], in which players are able to coordi- 
nate through public signals, are easier to compute than NE and can yield better 
outcomes. Social fairness, which minimises the differences between the objectives 
of individual players, can be considered for both CE and NE. 

We first investigate these concepts for the simpler case of normal form games, 
illustrating their differences and benefits. We then extend the approach to the 
more powerful modelling formalism of CSGs and extend the temporal logic 
rPATL to formally specify agent objectives. We present algorithms to synthesise 
equilibria, using linear programming to find CE and a combination of back- 
wards induction or value iteration for CSGs. We implement our approach in 
the PRISM-games tool [22] and demonstrate significant gains in computation 
time and that quantifiably more fair and useful strategies can by synthesised 
for a range of application domains. An extended version of this paper, with the 
complete model checking algorithm, is available [23]. 


Related work. Nash equilibria have been considered for concurrent systems 
in [18], where a temporal logic is proposed whose key operator is a novel path 
quantifier which asserts that a property holds on all Nash equilibrium computa- 
tions of the system. There is no stochasticity and correlated equilibria are not 
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considered. In [2], a probabilistic logic that can express equilibria is formulated, 
along with complexity results, but no implementation has been provided. 

'The notion of fairness studied here is inspired by fairness of equilibria from 
economics [33,34] and aims to minimise the difference between the payoffs, as 
opposed to maximising the lowest payoff among the players in an NE [25]. Our 
notion of fairness can be thought of as a constraint applied to equilibria strate- 
gies, similar in style to social welfare, and used to select certain equilibria based 
on optimality. This is distinct from fairness used in verification of concurrent 
processes, where (strong) fairness refers to a property stating that, whenever a 
process is enabled infinitely often, it is executed infinitely often. This notion is 
typically defined as a constraint on infinite execution paths expressible in logics 
LTL and CTL* and needed to prove liveness properties. For probabilistic models, 
verification under fairness constraints has been formulated for Markov decision 
processes and the logic PCTL* [5,4]. For games on graphs, fairness conditions 
expressed as w-regular winning conditions can be used to synthesise reactive 
processes [8]. Algorithms for strong transition fairness for w-regular games have 
been recently studied in [6]. Both qualitative and quantitative approaches have 
been considered for verification under fairness constraints, but no equilibria. 


2 Normal Form Games 


We start by considering normal form games (NFGs), then define our equilibria 
concepts for these games, present algorithms and an implementation for com- 
puting them, and finally summarise some experimental results. 

We first require the following notation. Let Dist(X) denote the set of prob- 

ability distributions over set X. For any vector v € IR", we use v(i) to refer 
to the ith entry of the vector. For any tuple x = (zi,...,r4) € X”, element 
x’ € X and i < n, we define the tuples z; = (21,..., 24151, 2441, -.., Zn) and 
az ix] zm (Bigs dida D crie En): 
Definition 1 (Normal form game). A (finite, n-person) normal form game 
(NFG) is a tuple N = (N, A, u) where: N = {1,...,n} is a finite set of players; 
A = Aix: x A, and A; is a finite set of actions available to player i € N; 
u = (u1,...,u4) and u;: A> R is a utility function for player i € N. 


We fix an NFG N = (N, A, u) for the remainder of this section. In a play of N, 
each player i € N chooses an action from the set A; at the same time. If each 
player i chooses aj, then the utility received by player j equals u;(a1,..., an). 
We next define the strategies for players of N and strategy profiles comprising 
a strategy for each player. We also define correlated profiles, which allow the 
players to coordinate their choices through a (probabilistic) public signal. 


Definition 2 (Strategy and profile). A strategy c; for player i is an element 
of X; = Dist(A;) and a strategy profile o is an element of XN = Xx x X4. 


For strategy o; of player i, the support is the set of actions (a; € A; | a;(a;)>0} 
and the support of a profile is the product of the supports of the strategies. 
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Definition 3 (Correlated profile). A correlated profile is a tuple (T,«) com- 
prising T € Dist(D), where D = D,x---xDy, Di is a finite set of signals for 
player i, and ¢ = (&1,...,«,), where si: Di > Ai. 


For a correlated profile (7,¢), the public signal 7 is a joint distribution over 
signals D; for each player i such that, if player i receives the signal d; € D;, then 
it chooses action c;(d;). We can consider any correlated profile (7, €) as a joint 
strategy, i.e., a distribution over A1x --- x A, where: 


(7, €)(44,..., a5) = PS (T(di,...,ds) | di € Di ^c(di) = a; for allie N}. 


Conversely, any joint strategy 7 € Dist(A1x--- x A4) can be considered as a 
correlated profile (7,¢) where D; = A; and c; is the identity function for i € N. 

Any strategy profile c can be mapped to an equivalent correlated profile (in 
which 7 is the joint distribution 01 x --- xo and c; is the identity function). On 
the other hand, there are correlated profiles with no equivalent strategy profile. 
Under profile o and correlated profile (T, €) the expected utilities of player i are: 


pM CN 
uir, s) = Dit dn)ED T(di, 2215 dn) . ui(c (dà), — »Sn(dn)) . 


jaiii 


Example 1. Consider the two-player NFG where A; = {a}, ab} and a corre- 
lated profile corresponding to the joint distribution 7 € Dist(A;xA2) where 
T(al,al) = r(a2,a2) = 0.5. Under this correlated profile the players share a fair 
coin and both choose their first action if the coin is heads and their second action 


otherwise. This has no equivalent strategy profile. a 


Optimal equilibria of NFGs. We now introduce the notions of Nash equilib- 
rium [27] and correlated equilibrium [3], as well as different definitions of opti- 
mality for these equilibria: social welfare and social fairness. Using the notation 
introduced above for tuples, for any profile c and strategy o7, the strategy tuple 
c. corresponds to o with the strategy of player i removed and oc. [o7] to the 
profile ø after replacing player ?'s strategy with o7. 


Definition 4 (Best response). For a profile o and correlated profile (T,«), a 
best response for player i to o_; and (T, € ;) are, respectively: 


— a strategy of for player i such that u;(o_;|o7]) 2 ui(co—i|ci]) for all o; € Xi; 
— a function c* : Dij — A; for player i such that ui(r,« i|c?]) > uilt, s—ilsi]) 
for all functions si: D; > Aj. 


Definition 5 (NE and CE). A strategy profile o* is a Nash equilibrium (NE) 
and a correlated profile (t,<*) is a correlated equilibrium (CE) if: 


— oF is a best response to c* ; for alli € N; 
— cf is a best response to (T, *;) for alli € N; 


respectively. We denote by XN and X€ the set of NE and CE, respectively. 
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a ui(a) u»(o) us(o) 
(pro, , pros, pro3) |-1000|—1000| —100 
(proi, pros, yld4)|-1000| —100 | —5 
(proi, yldy, proa) 5 —5 5 
(pro, , ylds, yld4) 5 —5 —5 
(yld,,pros,pro3)| —5 |—1000] —100 
(yld,, prog, yld,) | —5 5 —5 
(yld,, yld5,pro3)| —5 —5 5 
(uld, yldy, yld,)| —10 | —10 | —10 


Fig. 1: Example: Cars at an intersection and the corresponding NFG. 


Any NE of N is also a CE, while there can exist CEs that cannot be represented 
by a strategy profile and therefore are not NEs. For each class of equilibria, 
NE and CE, we introduce two optimality criteria, the first maximising social 
welfare (SW), defined as the sum of the utilities, and the second maximising 
social fairness (SF), which minimises the difference between the players' utilities. 
Other variants of fairness have been considered for NE, such as in [25], where 
the authors seek to maximise the lowest utility among the players. 


Definition 6 (SW and SF). An equilibrium o* is a social welfare (SW) equi- 
librium if the sum of the utilities of the players under o* is maximal over all 
equilibria, while o* is a social fair (SF) equilibrium if the difference between the 
player's utilities under o* is minimised over all equilibria. 


We can also define the dual concept of cost equilibria [24], where players try to 
minimise, rather than maximise, their expected utilities by considering equilibria 
of the game N^ = (N, A, —u) in which the utilities of N are negated. 


Example 2. Consider the scenario, based on an example from [32], where three 
cars meet at an intersection and want to proceed as indicated by the arrows 
in Figure 1. Each car can either proceed or yield. If two cars with intersecting 
paths proceed, then there is an accident. If an accident occurs, the car having 
the right of way, i.e., the other car is to its right, has a utility of —100 and the 
car that should yield has a utility of —1000. If a car proceeds without causing an 
accident, then its utility is 5 and the cars that yield have a utility of —5. If all 
cars yield, then, since this delays all cars, all have utility —10. The 3-player NFG 
is given in Figure 1. Considering the different optimal equilibria of the NFG: 


— the SWNE and SWCE are the same: for c2 to yield and cı and c3 to proceed, 
with the expected utilities (5, —5, 5); 

— the SFNE is for c, to yield with probability 1, co to yield with probability 
0.863636 and c3 to yield with probability 0.985148, with the expected utilities 
(—9.254050, —9.925742, —9.318182); 

— the SFCE gives a joint distribution where the probability of c9 yielding and 
of c4 and cs yielding are both 0.5 with the expected utilities (0, 0, 0). 


Modifying uz such that u2(pro,, pros, proz) = —4.5 to, e.g., represent a reckless 
driver, the SWNE becomes for cı and cg to yield and c2 to proceed with the 
expected utilities (—5,5, —5), while the SWCE is still for co to yield and cı and 
c3 to proceed. The SENE and SFCE also do not change. a 
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Algorithms for computing equilibria. Before we give our algorithm to com- 
pute correlated equilibria, we briefly describe the approach of [21,24] for Nash 
equilibria computation that this paper builds upon. Finding NE in two-player 
NFGs is in the class of linear complementarity problems (LCPs) and we follow 
the algorithm presented in [24], which reduces the problem to SMT via labelled 
polytopes [28] by considering the regions of the strategy profile space, itera- 
tively reducing the search space as positive probability assignments are found 
and added as restrictions on this space. To find SWNE and SFNE, we can enu- 
merate all NE and then find the optimal NE. 

When there are more than two players, computing NE values becomes a more 
complex task, as finding NE within a given support no longer reduces to a linear 
programming (LP) problem. In [21] we presented an algorithm using support 
enumeration [31], which exhaustively examines all sub-regions, i.e., supports, 
of the strategy profile space, one at a time, checking whether that sub-region 
contains NEs. For each support, finding SWNE can be reduced to a nonlinear 
programming problem [21]. This nonlinear programming problem can be modified 
to find SFNE in each support, similarly to how the LP problem for SWCEs is 
modified to find SFCEs below. 

In the case of CE we can first find a joint strategy for the players, i.e., 
a distribution over the action tuples, which, as explained above, can then be 
mapped to a correlated profile. A SWCE can be found by solving the following 
LP problem. Maximise: ^; y X aca Ui(@) Pa subject to: 


Da sca íQitoi[ai]) — w(o-ilai])) * Po > 
< 


for alli € N, a € A, aj, a; € Ai, a-i € A; where Ai E {a_; | a € A}. 
The variables pa represent the probability of the joint strategy corresponding 
to the correlated profile selecting the action-tuple a. The above LP has |A| 
variables, one for each action-tuple, and 5 5;- y (|A;|? — | A;|) +|A|+1 constraints. 
Computation of SFCE can be reduced to the following optimisation problem. 
Minimise p"?* — p™™ subject to: (1), (2) and (3) together with: 


p bye s uila) (4) 
(Amenp' 2 p™) > (p™* = p’) (5) 
Ame ND" <S p™) > (pm = p’) (6) 


for alli € N, m £ i, a € A, aj,a; € Ai, oL; € A_;. Again, the variables pa in 
the program represent the probability of the players playing the joint action a. 
The constraint (4) requires pê to equal the utility of player i. The constraints 
(5) and (6) set p™* and p?" as the maximum and minimum values within the 
utilities of the players, respectively. Given we use the constraints (1), (2) and 
(3), we start with the same number of variables and constraints as needed to 
compute SWCEs and incur an additional |N|+2 variables and 3. |N | constraints. 
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Gare Players||A:|| ip] = SW SWTSF 

4| i6 225| 0.07|0.02|0.08 

6| 36| 3,969! 0.1/0.02] 0.1 

Majority voting 2 8 64 65,025} 0.4|0.03| 0.3 

games 10| 100/1,046,529| 5.8/0.07| 0.7 

3 3| 27 343| 1.2|0.07| 0.1 

4| 81]  3,375| 25.8|0.08| 0.3 

3) 3| 7 343| 8.7|0.08| 1.7 

i 4| 81]  3,375|598.5]0.08| 2.9 

Covariant 2| 256| —6,561| TO] 0.3| TO 
games 8 

3/6,561]5,764,801, TO 22.8| TO 

10 21024| 59,049| TO| 12| TO 

Table 1: Times (s) for synthesis of equilibria in NFGs (timeout 30 mins). 


Implementation. To find SWNE or SFNE of two-player NFGs, we adopt a 
similar approach to [24], using labelled polytopes to characterise and find NE 
values through a reduction to SMT in both Z3 [13] and Yices [14]. As an op- 
timised precomputation step, when possible we also search for and filter out 
dominated strategies, which speeds up the computation and reduces solver calls. 

For NFGs with more than two players, solving the nonlinear programming 
problem based on support enumeration has been implemented in [21] using a 
combination of the SMT solver Z3 [13] and the nonlinear optimisation suite 
IPoPT [38]. To mitigate the inefficiencies of an SMT solver for such problems, 
we used Z3 to filter out unsatisfiable support assignments with a timeout and 
then IPOPT is called to find SWNE values using an interior-point filter line-search 
algorithm [39]. To speed up the overall computation, the support assignments are 
analysed in parallel. Computing SFNE increases the complexity of the nonlinear 
program and, due to the inefficiency in this approach [21], we have not extended 
the implementation to compute SFNE. 

As shown above, computing SWCE for NFGs reduces to solving an LP, and 
we implement this using either the optimisation solver Gurobi [17] or the SMT 
solver Z3 [13]. In the case of SFCE, the constraints (5) and (6) include impli- 
cations, and therefore the problem does not reduce directly to an LP. When 
using Z3, we can encode these constraints directly as it supports assertions that 
combine inequalities with logical implications, a feature that linear solvers such 
as Gurobi do not have. Section 5 discusses implementing SFCE computation in 
Gurobi. Both solvers support the specification of lower priority or soft objectives, 
which makes it possible to have a consistent ordering for the players’ payoffs in 
cases where multiple equilibria exist. 


Efficiency and scalability. Table 1 presents experimental results for solving 
a selection of NFGs randomly generated with GAMUT [29], using Gurobi for 
SWCE and NE of two-player NFGs, Z3 for SFCE and both IPoPT and Z3 for 
NFGs of more than two players, and running on a 2.10GHz Intel Xeon Gold with 
32GB of JVM memory. For each instance, Table 1 lists the number of players, 
actions for each player, joint actions and supports that need to be enumerated 
when finding NE, as well as the time to find SWNEs, SWCEs and SFCEs (the 
time for finding SFNEs of two-player games is the same as for SWNEs). As the 
results demonstrate, due to a simpler problem being solved and the fact that we 


Correlated Equilibria and Fairness in Concurrent Stochastic Games 67 


do not need to enumerate the solutions, computing CEs scales far better than 
NEs as the number of players and actions increases. Finding NEs in games with 
more than two players is particularly hard as the constraints are nonlinear. We 
also see that SFCE computation is slower than SWCE, which is caused by the 
additional variables and constraints required when finding SFCE and using Z3 
rather than Gurobi for the solver. 


3 Concurrent Stochastic Games 


We now further develop our approach to support concurrent stochastic games 
(CSGs) [36], in which players repeatedly make simultaneous action choices that 
cause the game’s state to be updated probabilistically. We extend the previously 
introduced definitions of optimal equilibria to such games, focusing on subgame- 
perfect equilibria, which are equilibria in every state of a CSG. We then present 
algorithms to reason about and synthesise such equilibria. 


Definition 7 (Concurrent stochastic game). A concurrent stochastic multi- 
player game (CSG) is a tuple G = (N, S, S, A, A, ô, AP, L) where: 


| 


N = (1,...,n] is a finite set of players; 

— S is a finite set of states and S C S is a set of initial states; 
A-(AU(Lx---x(AQU(.LT) and A; is a finite set of actions available 
to player i € N and L is an idle action disjoint from the set U? 4 Aj; 


| 


A: S — 2¥i=14% is an action assignment function; 
— 6: (Sx A) > Dist(S) is a (partial) probabilistic transition function; 
AP is a set of atomic propositions and L: S 24? is a labelling function. 


| 


| 


For the remainder of this section we fix a CSG G as in Definition 7. The game 
G starts in one of its initial states 5 € S and, supposing G is in a state s, then 
each player i of G chooses an action from the set that are available, defined 
as A,(s) = A(s) A; if A(s) N A; is non-empty and A;(s) = {1} otherwise. 
Supposing each player chooses a;, then the game transitions to state s” with 
probability (s, (a1,...,@n)). To enable quantitative analysis of G we augment it 
with reward structures, which are tuples r=(r4,rs) of an action reward function 
rA: Sx À — R and state reward function rg: S > R. 

A path of G is a sequence 7 = so 295, sq “> ... where sy € S, ay = 
(ak,...,ak) € A, af € Ai(sk) for i € N and (sk, ak)(sk+1) > 0 for all k > 
0. We denote by FPathsgc,, and IPathsc,, the sets of finite and infinite paths 
starting in state s of G respectively and drop the subscript s when considering 
all finite and infinite paths of G. As for NFGs, we can define strategies of G 
that resolve the choices of the players. Here, a strategy for player i is a function 
ci: FPathsg — Dist(A; U {L}) such that, if e;(7)(a;) 20, then a; € A;(last(v)) 
where last() is the final state of m. Furthermore, we can define strategy profiles, 
correlated profiles and joint strategies analogously to Definitions 2 and 3. 
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The utility of a player i of G is defined by a random variable X;: [Pathsg — R 
over infinite paths. For a profilet ø and state s, using standard techniques [20], 
we can construct a probability measure Probe , over the paths with initial state s 
corresponding to c, denoted IPathse , and the expected value EZ ,(X;) of player 
i's utility from s under c. Given utilities X1,..., Xn for all the players of G, we 
can then define NE and CE (see Definition 5) as well as the restricted classes of 
SW and SF equilibria as for NFGs (see Definition 6). Following [24,21], we focus 
on subgame-perfect equilibria [30], which are equilibria in every state of G. 


Nonzero-sum properties. As in [24] (for two-player CSGs) and [21] (for n- 
player CSGs) we can specify equilibria-based properties using temporal logic. 
For simplicity, we restrict attention to nonzero-sum properties without nesting, 
allowing for the specification of NE and CE against either SW or SF optimality. 


Definition 8 (Nonzero-sum specifications). The syntax of nonzero-sum spec- 
ifications 0 for CSGs is given by the grammar: 


p = (C). *2)opt-s (9) 

0 = P[p]+---+P[y] | R[o]+ -R'[o] 
Y = Xa | aua | aUa 

pg | gs" | Fa 


where C = Cy: ++ +:Cm, Ci,...,Cm are coalitions of players such that C;'1C; = Ø 
for all 1 <i #7 € m and UMC; = N, (*1,*2) € {NE,CE}x{sw, SF}, opt € 
{min, max}, ~E («,x,2, 5], x € Q, r is a reward structure, k € N and a is 
an atomic proposition. 


The nonzero-sum formulae of Definition 8 extend the logic of in [24,21] in that 
we can now specify the type of equilibria, NE or CE, and optimality criteria, SW 
or SF. A probabilistic formula (Cy: - Cm} (41, *2)maxea (P | v1 ]+:--+P[ Wm ]) is 
true in a state if, when the players form the coalitions C),...,Cm, there is a 
subgame-perfect equilibrium of type x; meeting the optimality criterion x2 for 
which the sum of the values of the objectives P [v ], ... , P[ Ym ] for the coalitions 
C5,..., C4 satisfies ex. The objective v; of coalition C; is either a next (Xa), 
bounded until (a, US* a2) or until (a, U a2) formula, with the usual equivalences, 
e.g., F a = true U a. 

For a reward formula (C1: - Cm) G3, *2)optwx(R™ [p1 |+- R7 [ Pm ]) the 
meaning is similar; however, here the objective of coalition C; refers to a re- 
ward formula p; with respect to reward structure r; and this formula is either 
a bounded instantaneous reward (I^^), bounded accumulated reward (CSF) or 
reachability reward (F a). 

For formulae of the form (C1: - -Cm G1, *2)min-s (0), the dual notions of 
cost equilibria are considered. We also allow numerical queries of the form 
(C3: Cu) (41, *2)opc-7 (0), which return the sum of the optimal subgame- 
perfect equilibrium's values. 


^ We can also construct such a probability measure and expected value given a corre- 
lated profile or joint strategy. 
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Model checking nonzero-sum specifications. Similarly to [24,21], to allow 
model checking of nonzero-sum properties we consider a restricted class of CSGs. 
We make the following assumption, which can be checked using graph algorithms 
with time complexity quadratic in the size of the state space [1]. 


Assumption 1. For each subformula P|aı U a2], a state labelled 7a, V ag is 
reached with probability 1 from all states under all strategy profiles and correlated 
profiles. For each subformula R"|F a], a state labelled a is reached with probability 
1 from all states under all strategy profiles and correlated profiles. 


We now show how to compute the optimal values of a nonzero-sum formula 
o = (Cir: Ca). *2)opt-s(0) when opt = max. The case when opt = min 
can be computed by negating all utilities and maximising. 

The model checking algorithm broadly follows those presented in [24,21], with 
the differences described below. The problem is reduced to solving an m-player 
coalition game GC where C = (C1,..., Cm} and the choices of each player i in GC 
correspond to the choices of the players in coalition C; in G. Formally, we have 
the following definition in which, without loss of generality, we assume C is of 
the form ((1,..., ni; (mi, ... na]; {Nm-1+1,...Nm}} and let je denote 
player j's position in its coalition. 


Definition 9 (Coalition game). For CSG G = (N,S,S,A,A,6, AP, L) and 
partition C ={Ci,...,Cm} of the players into m coalitions, we define the coali- 
tion game GC = ((1,..., m}, S, 8, A^, AC, 5°, AP, L) as an m-player CSG where: 


- AC = (AGU (Lx x(ASU (L9); 

— AS = (ILig, CA; U 1:9) G,- .,1)]) for alll <i < m; 

— for any s € S and 1 « i € m: a€ € AC (s) if and only if either A(s)h A; = Ø 
and a£ (jc) = L or af (je) € A(s) for all j € Ci; 

— for any s € S and (a£,..., aC.) € AC: 6€ (s, (af,...,a°,)) = 6(s, (a1,...,@n)) 
where for i € M and j € C; if af=L, then aj=L and otherwise aj=a§ (jc). 


If all the objectives in 0 are finite-horizon, backward induction [35,27] can be ap- 
plied to compute (precise) optimal equilibria values with respect to the criterion 
x2 and equilibria type x1. On the other hand, if all the objectives are infinite- 
horizon, value iteration [9] can be used to approximate optimal equilibria values 
and, when there is a combination of objectives, the game under study is modified 
in a standard manner to make all objectives infinite-horizon. 

Backward induction and value iteration over the CSG G^ both work by iter- 
atively computing new values for each state s of GC. The values for each state, 
in each iteration, are found by computing optimal equilibria values of an NFG N 
whose utility function is derived from the outgoing transition probabilities from 
s in the CSG and the values computed for successor states of s in the previous 
iteration. The difference here, with respect to [21], is that the NFGs are solved 
for the additional equilibria and optimality conditions considered in this paper, 
which we compute using the algorithms presented in Section 2. 


Algorithm for probabilistic until. Because of space limitations, we only 
present here the details of value iteration for (unbounded) probabilistic until, i.e., 
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for 6 = (C1: : Ca) Ga *2)max s (9) where 0 = P[al U ab ]+---+P[ay? U al]. 
The complete model checking algorithm can be found in [23]. 

Following [21], we use Vec(s,*1,*2,0,n) to denote the vector of computed 
values, at iteration n, in state s of GC for optimality criterion x» (SW or SF), 
equilibria type xı (NE or CE) and (until) objectives 0. We also use 1m and Om 
to denote a vector of size m whose entries all equal to 1 or 0, respectively. For 
any set of states S’, atomic proposition a and state s we let ng;/(s) equal 1 if 
s € S' and 0 otherwise, and na(s) equal 1 if a € L(s) and 0 otherwise. 

Each step of value iteration also keeps track of two sets D, E C M, where 
M = (1,..., m) are the players of GC. We use D for the subset of players that 
have already reached their goal (by satisfying a$) and E for the players who 
can no longer can satisfy their goal (having reached a state that fails to satisfy 
a‘). It can then be ensured that their payoffs no longer change and are set to 1 
or 0, respectively. In these cases, we effectively consider a modified game where, 
although the payoffs for these players are set, we still need to take their strategies 
into account in order to guarantee an optimal equilibrium. 

Optimal values for all states s in the CSG G^ can be computed as the follow- 
ing limit: Voc (s, x1, 2,0) = lim, 555 Vac (8, *1, *2, 0, n), where Vee (s, «1, *2,9,n) = 
Vac (s,*1, *2, €, 2,0, n) and, for any D, E C M such that DN E = Ø: 


(np(1), ..., np(m)) iDUE-M 
(Mat (s), Cos Nag (s)) else if n = 0 
Vac (5,53, *2, D, E,0,n) = 4 Vec(s,*1,%2, DU D',E,0,n) else if D'Z Ø 
Vac(s,*1,*2, D, EU E',0,n) else if E' # Ø 
val(N,*1,*2) otherwise 


where D' = {1 € M\(DUE) | a € L(s)}, E = (le M\(DUE) | al g 
L(s) and s € L(a5)) and val(N,x1,*2) equals optimal values of the NFG N = 
(M, AC, u) with respect to the criterion x2 and of equilibria type x, in which for 
any 1<l<m and o € AC: 


1 ifle D 
ula) = 0 else if 1 € E 
Mes 9° (s, o) (s^) - v$", otherwise 
and (ve v8, ae jue) = Vee (s’, «1, *2, D, E,0, n—1) for all s' € S. 


Since this paper considers equilibria for any number of coalitions (in par- 
ticular, for more than two), the above follows the algorithm of [21] in the way 
that it keeps track of the coalitions that have satisfied their objective (D) or can 
no longer do so (E). By contrast the CSG algorithm of [24] was limited to two 
coalitions, which enabled the exploitation of efficient MDP analysis techniques 
for such coalitions. As explained in [21], in such a scenario we cannot reduce the 
analysis from an n-coalition game to an (n — 1)-coalition game, as otherwise we 
would give one of the remaining coalitions additional power (the action choices 
of the coalition that has satisfied their objective or can no longer do so), which 
would therefore give this coalition an advantage over the other coalitions. 


Correlated Equilibria and Fairness in Concurrent Stochastic Games 71 


Strategy synthesis. As in [24,21] we can extend the model checking algorithm 
to perform strategy synthesis, generating a witness (i.e., a profile or joint strat- 
egy) representing the corresponding optimal equilibrium. This is achieved by 
storing the profile or joint strategy for the NFG solved in each state. Both the 
profiles and joint strategies require finite memory and are probabilistic. Memory 
is required as choices change after a path formula becomes true or a target is 
reached and to keep track of the step bound in finite-horizon properties. Ran- 
domisation is required for both NE and CE of NFGs. 


Correctness and complexity. The correctness of the algorithm follows directly 
from [24,21], as changing the class of equilibria or optimality criterion does not 
change the proof. The complexity of the algorithm is linear in the formula size 
and value iteration requires finding optimal NE or CE for an NFG in each state 
of the model. Computing NEs of an NFG with two (or more) players is PPAD- 
complete [12,11], while finding optimal CEs of an NFG is in P [15]. 


4 Case Studies and Experimental Results 


We have developed an implementation of our techniques for equilibria synthe- 
sis on CSGs, described above, building on top of the PRISM-games [22] model 
checker. Our implementation extends the tool's existing support for construction 
and analysis of CSGs, which is contained within its sparse matrix based “explicit” 
engine written in Java. We have considered a range of CSG case studies (supple- 
mentary material can be found at [40]). Below, we summarise the efficiency and 
scalability of our approach, again running on a 2.10GHz Intel Xeon Gold with 
32GB JVM memory, and then describe our findings on individual case studies. 


Efficiency and scalability. Table 2 summarises the performance of our imple- 
mentation on the case studies that we have considered. It shows the statistics for 
each CSG, and the time taken to build it and perform equilibria synthesis, for 
several different variants (NE vs. CE, SW vs. SF). Comparing the efficiency of 
synthesising SWNE and SWCE, we see that the latter is typically much faster. 
For two-player NE, the social fairness variant is no more expensive to compute as 
we enumerate all NEs. For CE, which uses Z3 rather than Gurobi for finding SF, 
we note that, although Z3 is able to find optimal equilibria, it is not primarily 
developed as an optimisation suite, and therefore generally performs poorly in 
comparison with Gurobi. The benefits of the social fair equilibria, in terms of 
the values yielded for individual players, are discussed in the in-depth coverage 
of the different case studies below. 


Aloha. In this case study, introduced in [24], a number of users try to send 
packets using the slotted Aloha protocol. We suppose that each user has one 
packet to send and, in a time slot, if k users try and send their packet, then 
the probability that each packet is successfully sent is q/k where q € [0,1]. If a 
user fails to send a packet, then the number of slots it waits before resending 
the packet is set according to Aloha’s exponential backoff scheme. The scheme 
requires that each user maintains a backoff counter, which it increases each time 
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Case study & property Players x3 Param. CSG statistics Constr.| Verif. 
[parameters] y, 1*2] values | States Trans. |time(s) |time (s) 
NE,SW 2.2 
CE,SW 2.1 
Aloka 2 (ese | 408 2,778 6,285 0.1 21 
Qa 2) min 2? (R"*[F s; ]) CE,SF 23.3 
[bras q] 3 [OSW] 468 | 107,799| 355,734) 3.0) ,50-2 
CE,SF 114.6 
NE,SW 1042.9 
4 CE,SW 2,0.8 68,689| 161,904 1.9 58.8 
Aloh NE,SW 1027.5 
(*1,*2) max e Paon fF siAt<D}) ‘ bd ——— ore B 4 AUR 
CE,SW ,936. 
[bmaz, q, D] 5 Oesp | 20-8,8 1,797,742/5,236,055| 54.5 TO 
Power control NE,SW 564.5 
ee ee ees, 2  |NEF|8,40,0.20 32,812} 260,924 1.2) 566.3 
, ax =? i nuu pe 
[DO wmaz , Cmax s Tail] CES 4 4 TD 
3 CE,SF 5,15,0.2 2,156| 740,758 3.5 TO 
Public good 8 ened 25,3 16,202| 35,884 0.8 E 
(61,2) max =2(R°[I="™ ]) AD 71.9 
[fs reos] 4 [Ceswi 33 | 391,961) 923,401] 13.0) 353 
5 CE,SW 4,2 59,294} 118,342 3.1 5.2 
Investors 2 [OSW] 998 | 71,731] 315,804) 24 a 
(+1,*2)max =? (R? [F cin; ]) UA eme 
[Poar; months] 3 ps 0.2,5 83,081| 462,920 8| ula 


Table 2: Statistics for a set of CSG verification instances (timeout 2 hours). 


there is a packet failure (up to bmax) and, if the counter equals k and a failure 
occurs, randomly chooses the slots to wait from {0,1,...,2*—1}. 


We suppose that the objective of each user is to minimise the expected 
time to send their packet, which is represented by the nonzero-sum formula 
(usr: +++ CUST mY 0, *2)min-? (RP"*[F s1 J+- -2-RÉ""*[F sm ]). Synthesising opti- 
mal strategies for this specification, we find that the cases for SWNE and SWCE 
coincide (although SWCE returns a joint strategy for the players, this joint strat- 
egy can be separated to form a strategy profile). This profile requires one user 
to try and send first, and then for the remaining users to take turns to try and 
send afterwards. If a user fails to send, then they enter backoff and allow all 
remaining users to try and send before trying to send again. There is no gain to 
a user in trying to send at the same time as another, as this will increase the 
probability of a sending failure, and therefore the user having to spend time in 
backoff before getting to try again. For SFNE, which has only been implemented 
for the two-player case, the two users follow identical strategies, which involve 
randomly deciding whether to wait or transmit, unless they are the only user 
that has not transmitted, and then they always try to send when not in backoff. 
In the case of SFCE, users can employ a shared probabilistic signal to coordinate 
which user sends next. Initially, this is a uniform choice over the users, but as 
time progresses the signal favours the users with lower backoff counters as these 
users have had fewer opportunities to send their packet previously. 


In Figure 2 we have plotted the optimal values for the players, where SW; 
correspond to the optimal values (expected times to send their packets) for player 
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two users three users four users 


Expected time 
Expected time 


Expected time 
c^ M UG 4& Oc O 0 0 


CN Q 4» OUO - 0 d 


Fig.2: Aloha: ((usry: +++ UST mY (x1, *2)min-? (R"*[F s1 J+ - ---RÉ"*[F sm ]) 


i for both SWNE and SWCE for the cases of two, three and four users. We see 
that the optimal values for the different users under SFNE and SFCE coincide, 
while under SWNE and SWCE they are different for each user (with the user 
sending first having the lowest and the user sending last the highest). Comparing 
the sum of the SWNE (and SWCE) values and that of the SFCE values, we see 
a small decrease in the sum of less than 296 of the total, while for SFNE there 
is a greater difference as the players cannot coordinate, and hence try and send 
at the same time. 


Power control. This case study is based on a model of power control in cel- 
lular networks from [7]. In the network there are a number of users that each 
have a mobile phone. The phones emit signals that the users can strengthen by 
increasing the phone's power level up to a bound (pow,,,,). A stronger signal 
can improve transmission quality, but uses more energy and lowers the qual- 
ity of the transmissions of other phones due to interference. We use the ex- 
tended model from [22], which adds a probability of failure (qfai) when a power 
level is increased and assumes each phone has a limited battery capacity (emax). 
'There is a reward structure associated with each phone representing transmis- 
sion quality, which is dependent on both the phone's power level and the power 
levels of other phones due to interference. We consider the nonzero-sum prop- 
erty (py: pu) Ga, *2)max-? (R^ [F ei |+ -+R [F em ]), where each user tries 
to maximise their expected reward before their phone's battery is depleted. 

In Figure 3 we have presented the expected rewards of the players under 
the synthesised SWCE and SFCE joint strategies. When performing strategy 
synthesis, in the case of two users the SWNE and SWCE yield the same profile 
in which, when the users’ batteries are almost depleted, one user tries to increase 
their phone's power level and, if successful, in the next step, the second user then 
tries to increase their phone's power level. Since the first user's phone battery 
is depleted when the second tries to increase, this increase does not cause any 
interference. On the other hand, if the first user fails to increase their power 
level, then both users increase their battery levels. For the SFCE, the users 
can coordinate and flip a coin as to which user goes first: as demonstrated by 
Figure 3 this yields equal rewards for the users, unlike the SWCE. In the case of 
three users, the SWNE and SWCE differ (we were only able to synthesise SWNE 
for pow,,4, = 2 as for larger values the computation had not completed within 
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Fig. 4: Public good: (pi: saa Dm) Ga, *2)max-? Re [ [Smar J+ EROR [ I= mar J) 
the timeout), again users take turns to try and increase their phone’s power 
level. However, here if the users are unsuccessful the SWCE can coordinate as to 


which user goes next trying to increase their phone’s battery level. Through this 
coordination, the users’ rewards can be increased as the battery level of at most 
one phone increases at a time, which limits interference. On the other hand, for 
the SWNE users must decide independently whether to increase their phone’s 
battery level and they each randomly decide whether to do so or not. 


Public good. We next consider a variant of a public good game [19], based 
on the one presented in [22] for the two-player case. In this game a number 
of players each receive an initial amount of capital (einit) and, in each of Tmax 
months, can invest none, half or all of their current capital. The total invested 
by the players in a month is multiplied by a factor f and distributed equally 
among the players before the start of the next month. The aim of the play- 
ers is to maximise their expected capital which is represented by the formula: 
(pi: Pm) (41, #2) max? (RP [E77] "Re [17779 ]. 

Figure 4 plots, for the three-player model, both the expected capital of indi- 
vidual players and the total expected capital after three months for the SWNE, 
SWCE and SFNE as the parameter f varies. As the results demonstrate the play- 
ers benefit, both as individuals and as a population, by coordinating through a 
correlated strategy. In addition, under the SFCE, all players receive the same 
expected capital with only a small decrease in the sum from that of the SWCE. 


Investors. The final case study concerns a concurrent multi-player version of 
futures market investor model of [26], in which a number of investors (the players) 
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Fig. 5: Investors: (invi: + inva) (x1, *2)max=? (RAP 1[F cin; |+---+R?/™[F cin, ]) 


interact with a probabilistic stock market. In successive months, the investors 
choose whether to invest, wait or cash in their shares, while at the same time the 
market decides with probability pbar to bar each investor, with the restriction 
that an investor cannot be barred two months in a row or in the first month, 
and then the values of shares and cap on values are updated probabilistically. 

We consider both two- and three-player models, where each investor tries to 
maximise its individual profit represented by the following nonzero-sum prop- 
erty: (invi: inva) (x1, *2)max-? (R^ [F cin; ]+---+R?/m[F cinm |). In Figure 5 
we have plotted the different optimal values for NE and CE of the two-player 
game and the different optimal values for CE of the three-player game (the 
computation of NE values timed out for the three player case). As the results 
demonstrate, again we see that the coordination that CEs offer can improve the 
returns of the players and that, although considering social fairness does decrease 
the returns of some players, this is limited, particularly for CEs. 


5 Conclusions 


We have presented novel techniques for game-theoretic verification of proba- 
bilistic multi-agent systems, focusing on correlated equilibria and a notion of 
social fairness. We began with the simpler case of normal form games and then 
extended this to concurrent stochastic games, and used temporal logic to for- 
mally specify equilibria. We proposed algorithms for equilibrium synthesis, im- 
plemented them and illustrated their benefits, in terms of efficiency and fairness, 
on case studies from a range of application domains. 

Future work includes exploring the use of further game-theoretic topics within 
this area, such as techniques for mechanism design or other concepts such as 
Stackelberg equilibria. We plan to implement SFCE computation in Gurobi using 
the big-M method [16] to encode implications and techniques from [37] to encode 
conjunctions, which should yield a significant speed-up in their computation. 
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Abstract. We consider turn-based stochastic 2-player games on graphs 
with w-regular winning conditions. We provide a direct symbolic algo- 
rithm for solving such games when the winning condition is formulated 
as a Rabin condition. For a stochastic Rabin game with k pairs over a 
game graph with n vertices, our algorithm runs in O(n**?k!) symbolic 
steps, which improves the state of the art. 

We have implemented our symbolic algorithm, along with performance 
optimizations including parallellization and acceleration, in a BDD-based 
synthesis tool called Fairsyn. We demonstrate the superiority of Fairsyn 
compared to the state of the art on a set of synthetic benchmarks derived 
from the VLTS benchmark suite and on a control system benchmark from 
the literature. In our experiments, Fairsyn performed significantly faster 
with up to two orders of magnitude improvement in computation time. 


1 Introduction 


Symbolic algorithms for 2-player graph games are at the heart of many prob- 
lems in the automatic synthesis of correct-by-construction hardware, software, 
and cyber-physical systems from logical specifications. The problem has a 
rich pedigree, going back to Church [10] and a sequence of seminal results 
(6,31,17,30,13,14,34,21]. A chain of reductions can be used to reduce the syn- 
thesis problem for w-regular specifications to finding winning strategies in 
2-player games on graphs, for which (symbolic) algorithms are known (see, e.g., 
[29,14,34,27]). These algorithms form the basis for algorithmic reactive synthesis. 

For systems under uncertainty, it is also essential to capture non-determinism 
quantitatively using probability distributions [5,18,22,25]. Turn-based stochas- 
tic 2-player games [3,9], also known as 21/2-player games, generalize 2-player 
graph games with an additional category of “random” vertices: Whenever the 
game reaches a random vertex, a random process picks one of the outgoing 
edges according to a probability distribution. The qualitative winning problem 
asks whether a vertex of the game graph is almost surely winning for Player 0. 
Stochastic Rabin games were studied by Chatterjee et al. [7], who showed that 
the problem is NP-complete and that winning strategies can be restricted to 
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be pure (non-randomized) and memoryless. Moreover, they showed a reduc- 
tion from qualitative winning in an n-vertex k-pair stochastic Rabin game to 
an O (n(k + 1))-vertex (k + 1)-pair (deterministic) Rabin game, resulting in an 
O ((n(k + 1))F*?(k + 1)!) algorithm. In contrast, we provide a direct O(n**?k!) 
symbolic algorithm for the problem. 

Our new direct symbolic algorithm is obtained in the following way. We 
replace the probabilistic transitions with transitions of the environment con- 
strained by extreme fairness as described by Pnueli [28]. Extreme fairness is 
specified via a special set of Player 1 vertices, called live vertices. A run is ex- 
tremely fair if whenever a live vertex is visited infinitely often, every outgoing 
edge from this vertex is taken infinitely often. As our first contribution, we show 
that to solve a qualitative stochastic Rabin game, we can equivalently solve a 
(deterministic) Rabin game over the same game graph by interpreting random 
vertices of the stochastic game as live vertices. 

As our second contribution we prove a direct symbolic algorithm to solve 
(deterministic) Rabin games with live vertices, which we call extremely fair ad- 
versarial Rabin games. In particular, we show a surprisingly simple syntactic 
transformation that modifies well-known symbolic fixpoint algorithm for solving 
2-player Rabin games on graphs (without live vertices), such that the modified 
fixpoint solves the extremely fair adversarial version of the game. 

To appreciate the simplicity of our modification, let us consider the well- 
known fixpoint algorithms for Büchi and co-Büchi games— particular classes of 
Rabin games—given by the following p-calculus formula: 


Büchi: vY. uX. (GN Cpre(Y)) U (Cpre( X)), 
Co-Büchi: uX. vY. (GU Opre(X)) n (Cpre(Y)). 


where Cpre(-) denotes the controllable predecessor operator and G denotes the 
set of goal states that should be visited recurrently. In the presence of strong 
transition fairness, the new algorithm becomes 


Büchi: vY. uX. (GN Cpre(Y)) U (Apre(Y, X)), 
Co-Büchi: vW. uX. vY. (GU Apre(W, X)) n (Cpre(Y)) . 


The only syntactic change (highlighted in blue) we make is to substitute the 
controllable predecessor for the u variable X by a new almost sure predecessor 
operator Apre(Y, X) incorporating also the previous v variable Y ; if the fixpoint 
starts with a u variable (with no previous v variable), like for co-Büchi games, 
we introduce one additional v variable in the front. For the general class of 
Rabin specifications, with a more involved fixpoint and with arbitrarily high 
nesting depth depending on the number of Rabin pairs, we need to perform this 
substitution for every such Cpre(-) operator for every p variable. 

We prove the correctness of this syntactic fixpoint transformation for solv- 
ing Rabin games [31,27] in this paper. It can be shown that the same syntactic 
transformation may be used to obtain fixpoint algorithms for qualitative solution 
of stochastic games with other popular w-regular objectives, namely Reachabil- 
ity, Safety, (generalized) Büchi, (generalized) co-Büchi, Rabin-chain, parity, and 
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GR(1). Owing to page constraints, these additional fixpoints are only discussed 
in the extended version [4] of this paper, where we also generalize all results 
presented in this paper to a weaker notion of fairness, called transition fairness. 
In a nutshell, these results show that one can solve games with live vertices 
while retaining the algorithmic characteristics and implementability of known 
symbolic fixpoint algorithms that do not consider fairness assumptions. 

We have implemented our symbolic algorithm for solving stochastic Rabin 
games in a symbolic BDD-based reactive synthesis tool called Fairsyn. Fairsyn 
additionally uses parallellization and a fixpoint acceleration technique [23] to 
boost performance. We evaluate our tool on two case studies, one using synthetic 
benchmarks derived from the VLTS benchmark suite [15] and the other from 
controller synthesis for stochastic control systems [12]. We show that Fairsyn 
scales well on these case studies, and outperforms the state-of-the-art methods 
by up to two orders of magnitude. 

All the technical proofs, the fixpoints for various other specifications, and an 
additional benchmark taken from the software engineering literature [8] can be 
found in the extended version of this paper under a slighly more relaxed setting 
of the problem (transition fairness instead of extreme fairness) [4]. 


2 Preliminaries 


Notation: We write No to denote the set of natural numbers including zero. 
Given a,b € No, we write [a;b] to denote the set (n € No | a € n < b}. By 
definition, [a;b] is an empty set if a > b. For any set A C U defined on the 
universe U, we write A to denote the complement of A. Given an alphabet A, 
we use the notation A* and A” to denote respectively the set of all finite words 
and the set of all infinite words formed using the letters of the alphabet A. Let 
A and B be two sets and R C A x B be a relation. For any element a € A, we 
use the notation R(a) to denote the set (b € B | (a,b) € R}. 


21/2-player game graph: We consider usual turn-based stochastic games, also 
known as 21/2-player games, played between Player 0, Player 1, and a third player 
representing environmental randomness which is treated as a “half player." For- 
mally, a 21/2-player game graph is a tuple G = (V, Vo, Vj, V., E) where (i) V is a 
finite set of vertices, (ii) Vo, Vi, and V, are subsets of V which form a partition of 
V, and (iii) E C V x V is the set of directed edges. The vertices in V, are called 
random vertices, and the edges originating in a random vertex are called random 
edges, denoted as E,.. A 21/2-player game graph with no random vertices (i.e. 
V, = 0) is called a 2-player game graph. A 21/2-player game graph with Vj = () 
is called a 11/2-player game graph (also known as Markov Decision Processes or 
MDPs). A 21/2-player game graph with V = V, is known as a Markov chain. 


Strategies: A (deterministic) strategy of Player 0 is a function po: V*Vo — V 
with po(wv) € E(v) for every wv € V*Vo. Likewise, a strategy of Player 1 is a 
function p1: V*V; > V with pi(wv) € E(v) for every wv € V*Vi. We denote 
the set of strategies of Player i by II;. A strategy p; of Player i (i € {0,1}) is 
memoryless if for every w4v,w»v € V*V;, we have p;(w4v) = p;(wav). In this 
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paper we restrict attention to deterministic strategies, as randomized strategies 
are no more powerful than deterministic ones for 21/2-player Rabin games [7]. 


Plays: Consider an infinite sequence of vertices? m = v0vlo?... € V^. The 
sequence 7 is called a play over G starting at the vertex v? if for every i € No, we 
have v! € V and (vf, vit!) € E. A play is finite if it is of the form vOv! ... v" for 
some finite n € No. Let po € IIo and pı € I, be a pair of strategies for the two 
players, and v? € V be a given initial vertex. For every finite play m = vvt... v", 
the next vertex vt! is obtained as follows: If v" € Vo then v"*! = po(t9 ... v"); 
ifv” € Vi then v"*! = pi(v9 ... v"); and if v" € V, then vt is chosen uniformly 
at random from the set E,.(v"). The uniform probability distribution over the 
random edges is without loss of generality for the problem considered in this 
paper; we will come back to this after setting up the problem statement. Every 
play generated in this way by fixing po, p1, and v? is called a play compliant with 
po and pı that starts at verter v°. The random choice in the random vertices 
induces a probability measure P^?^' on the sample space of plays.” This is in 
contrast to 2-player games, where for any choice of pọ € Io, pı € M, and 
v? € V, the resulting compliant play is unique. 


Winning Conditions: A winning condition q is a set of infinite plays over C, 
i.e., o C V", where the game graph G will always be clear from the context. We 
adopt Linear Temporal Logic (LTL) notation for describing winning conditions. 
The atomic propositions for the LTL formulas are sets of vertices, i.e., elements 
of the set 2”. We use the standard symbols for the Boolean and the temporal 
operators: “~=” for negation, “A” for conjunction, “V” for disjunction, “>” for 
implication, “U” for until (AU B means “the play remains inside the set A until 
it moves to the set B”), “©” for nezt (OA means “the next vertex is in the set 
A”), “O” for eventually (QA means “the play will eventually visit a vertex from 
the set A”), and “O” for always (LIA means “the play will only visit vertices 
from the set A"). The syntax and semantics of LTL can be found in standard 
textbooks [3]. By slightly abusing notation, we use y interchangeably to denote 
both the LTL formula and the set of plays satisfying v. Hence, we write 7 € y 
to denote the satisfaction of the formula y by the play m. 


Rabin Winning Conditions: A Rabin winning condition is expressed using a 
set of k Rabin pairs R = { (G1, R1),..., (Gy, Re) }, where k is any positive integer 
and G;, Ri C V for all i € [1; k]. We say that R has the index set P = [1; k]. A 
play m satisfies the Rabin condition R if m satisfies the LTL formula 


Y = Viep (OUR; A DIG.) . (2) 


Almost Sure Winning: Let G be 21/2-player game graph, po € IIo and pı € Ih 
be a pair of strategies, v? € V be an initial vertex, and y be an w-regular 


^ [n our convention for denoting vertices, superscripts (ranging over No) will denote 
the position of a vertex within a given sequence/play, whereas subscripts, either 0, 
1, or r, will denote the membership of a vertex in the sets Vo, Vi, or V, respectively. 

? The unique measure P °°! is obtained through Carathéodory's extension theorem 
by extending the pre-measure on every infinite extension—called the cylinder set—of 
every finite play; see [3, pp. 757] for details. 
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specification over the vertices of G. Then P/?^'(i) denotes the probability of 
satisfaction of y by the plays compliant with pọ and pı and starting at t. 
The set of almost sure winning states of Player 0 for the specification is 
defined as the set W%* C V such that for every v? € W*: the following 
holds: sup, em info, er, Pfr’? (o) = 1. It is known [7, Thm. 4] that there is 
an optimal (deterministic) memoryless strategy pj € I[p—called the optimal 
almost sure winning strategy—such that for every v? € WW: it holds that 
inf pem P” pe d. 

We extend the notion of winning to 2-player games as follows. Fix a 2-player 
game graph G = (V, Vo, Vi, 0, E) and an w-regular specification y over V. Player 0 
wins the game from a vertex v? € V if Player 0 has a strategy po such that for 
every Player 1 strategy pi, the unique resulting play starting at v? is in y. The 
winning region W C V is the set of vertices from which Player 0 wins the game. 
It is known that Player 0 has a memoryless strategy pj— called the optimal 
winning strategy—such that for every Player 1 strategy pı € II, and for every 
initial vertex v? € W, the resulting unique compliant play is in q [19]. 


3 Problem Statement and Outline 


Given a 21/2-player game graph G and a Rabin specification as in (2), we 
consider the problem of solving the induced qualitative reactive synthesis prob- 
lem. That is, we want to compute the set of almost sure winning states YV^-* 
of G w.r.t. p and the corresponding optimal memoryless winning strategy pġ of 
Player 0. This problem was solved by Chatterjee et al. [7] via a reduction from 
qualitative winning in the original 21/2-player Rabin game to winning in a larger 
(deterministic) 2-player Rabin game with an additional Rabin pair. 

Instead of inflating the game graph and introducing an extra Rabin pair at 
the cost of more expensive computation, we propose a direct and computationally 
more efficient symbolic algorithm over the original game graph G. We get this 
algorithm by interpreting the random vertices of G as special Player 1 vertices, 
called live vertices, which are subject to an extreme fairness assumption: along 
every play, if a live vertex v is visited infinitely often, then all outgoing transitions 
of v are also taken infinitely often. This re-interpretation results in a 2-player 
Rabin game with special live Player 1 vertices that are subjected to extreme 
fairness assumptions on Player 1’s behavior. We call such games extremely fair 
adversarial (2-player) Rabin games. The correctness of our symbolic algorithm 
then follows from the two main results of our paper. 

(I) We show that qualitative winning in a 2!/2-player Rabin game G is equiv- 
alent to winning in the extremely fair adversarial (2-player) Rabin game G° 
obtained from G. Moreover, the winning strategy po of Player 0 in G^ is also the 
optimal almost sure winning strategy in G for y (see Thm. 1 in Sec. 4). 

(II) We give a direct symbolic algorithm to compute the set of winning states, 
along with the Player 0 winning strategy for extremely fair adversarial (2-player) 
Rabin games (see Thm. 2 in Sec. 5). 
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Both contributions are discussed in detail in Sec. 4 and Sec. 5, respectively. 
Even though, for convenience, we have assumed a uniform probability distribu- 
tion over the random edges, our contributions are valid for any arbitrary prob- 
ability distribution. This follows from the established fact that the qualitative 
analysis of 21/2-player games does not depend on the precise probability values 
but only on the supports of the distributions [7]. 

We conclude the paper by an experimental evaluation in Sec. 6. 


4 From Randomness to Extreme Fairness 


In this section, we show that qualitative winning in 21/2-player Rabin games 
is equivalent to winning in extremely fair adversarial (2-player) Rabin games 
over the same underlying game graph. While it is known [16, Thm. 11.1] that 
the reduction of random vertices to extreme fairness is sound and complete 
for liveness winning conditions? we extend this connection to arbitrary Rabin 
winning conditions in this section, and therefore to the entire class of w-regular 
specifications. We start with a formal definition of extremely fair adversarial 
games and the connection between randomness and extreme fairness, before 
stating our main result in Thm. 1. 


Extremely Fair Adversarial Games: Let G = (V, Vo, Vi, 0, E) be a 2-player 
game graph with live vertices V* C Vi, denoted using the tuple G^ = (G, V$). 
The set of edges originating from the live vertices are called the live edges, and 
is denoted as E* := (V* x V) E. A play x over G* is extremely fair with respect 
to V“ if it satisfies the following LTL formula: 


a = No wer (Ov => UO(v ^ Ov')). (3) 


Given G^ and an w-regular winning condition o over V, Player 0 wins the ez- 
tremely fair adversarial game over G* for ọ from a vertex v? € V if Player 0 


wins the game over G for the winning condition a > q from 4. 


Randomness as Extreme Fairness: Let G = (V, Vo, Vi, V., E) be a 21/2-player 
game graph. Then we say that G induces the 2-player game graph with live 
vertices G* :— ((V, Vo, V; U V, 0, E), V;). Intuitively, we interpret every random 
vertex of G as a live Player 1 vertex in G*. Obviously, this reinterpretation does 
not change the structure of the underlying graph specified by V and E. 


Soundness of the Reduction: It remains to show that the almost sure winning 
set and the optimal almost sure winning strategy of Player 0 in G for ọ is the same 
as the winning state set and the winning strategy of Player 0 in G for y. This is 
formalized in the following theorem when q is given as a Rabin condition. The 
proof essentially shows that the random vertices of G simulate the live vertices 
of G^, and vice versa; details are in the extended version [4, App. B.6, pp. 61]. 


$ An LTL formula y over V describes a liveness property if every finite play c over G 
allows for a continuation 7’ s.t. nT’ € q. 
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Theorem 1. Let G be a 21/2-player game graph with verter set V, p CV” bea 
Rabin winning condition as in (2), and G* be the 2-player game graph with live 
edges induced by G. Let W C V be the set of vertices from which Player 0 wins 
the extremely fair adversarial game over G* with respect to p, and W%*: be the 
almost sure winning set of Player 0 in the 21/2-player game G with respect to q. 
Then, W = W***:. Moreover, an optimal almost sure winning strategy in G^ is 
also an optimal winning strategy in G, and vice versa. 


5 Extremely Fair Adversarial Rabin Games 


'This section presents our main result, which is a symbolic fixpoint algorithm that 
computes the winning region of Player 0 in the extremely fair adversarial game 
over G* with respect to any w-regular property formalized as a Rabin winning 
condition. This new symbolic fixpoint algorithm has multiple unique features. 
(I) It works directly over G^, without requiring any pre-processing step to reduce 
G* to a “normal” 2-player game with larger set of vertices. 

(IT) Our new fixpoint algorithm is obtained from the algorithm of Piterman et al. 
[27] by a simple syntactic change. We simply replace all controllable predecessor 
operators over least fixpoint variables by a new almost sure predecessor operator 
invoking the preceding maximal fixpoint variable. This makes the proof of our 
new fixpoint algorithm conceptually simple (see Sec. 5.3). 

At a higher level, we make a simple yet efficient syntactic transformation of 
the fixpoint to incorporate the fairness assumption on the live vertices, without 
introducing any extra computational complexity. Most remarkably, this transfor- 
mation also works directly for fixpoint algorithms for reachability, safety, Büchi, 
(generalized) co-Büchi, Rabin-chain, and parity games, as these can be formal- 
ized as particular instances of a Rabin game. Moreover, it also works for gener- 
alized Rabin, generalized Büchi, and GR(1) games. Owing to page constrains, 
these additional cases are described in the extended version [4]. 


5.1 Preliminaries on Symbolic Computations over Game Graphs 


Set Transformers: Our goal is to develop symbolic fixpoint algorithms to char- 
acterize the winning region of an extremely fair adversarial game over a game 
graph with live edges. As a first step, given G^, we define the required symbolic 
transformers of sets of states. We define the existential, universal, and control- 
lable predecessor operators as follows. For S C V, we have 


Preg (S) := (v € Vo | E(v) NS £ 0), (4a) 
Prey (S) := (v € Vi | E(v) C S), and (4b) 
Cpre(S) := Preg(S) U Prey (S). (4c) 


Intuitively, the controllable predecessor operator Cpre( S) computes the set of all 
states that can be controlled by Player 0 to stay in S after one step regardless 
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of the strategy of Player 1. Additionally, we define two operators which take 
advantage of the fairness assumption on the live vertices. Given two sets S, T' C 
V, we define the live-existential and almost sure predecessor operators: 


Lpre?^(S) := (v € V^ | Ev) A S Z 0), and (5a) 
Apre(S, T) := Cpre(T) U (Lore? tr) N Prey (8) ; (5b) 


Intuitively, the almost sure predecessor operator’ Apre(S, T) computes the set 
of all states that can be controlled by Player 0 to stay in T' (via Cpre(T)) as well 
as all Player 1 states in V* that (a) will eventually make progress towards T if 
Player 1 obeys its fairness-assumptions encoded in a (via Lpre"(T)) and (b) will 
never leave S in the “meantime” (via Prey (5)). All the used set transformers are 
monotonic with respect to set inclusion. Further, Cpre(T) C Apre(S, T) always 
holds, Cpre(T) = Apre(S, T) if V£ = 0, and Apre(S, T) C Cpre(S) if T C S. 

Fixpoint Algorithms in the j-calculus: We use j/-calculus [20] as a con- 
venient logical notation to define a symbolic algorithm (i.e., an algorithm that 
manipulates sets of states rather than individual states) for computing a set of 
states with a particular property over a given game graph G. The formulas of the 
p-calculus, interpreted over a 2-player game graph C, are given by the grammar 


e := p|X|~Uyp|yeNy| pre(y) | uX.y | vX.o 


where p ranges over subsets of V, X ranges over a set of formal variables, pre 
ranges over monotone set transformers in [Preg, Prey, Cpre, Lpre?, Apre], and pu 
and v denote, respectively, the least and the greatest fixed point of the functional 
defined as X ++ q( X). Since the operations U, N, and the set transformers pre 
are all monotonic, the fixed points are guaranteed to exist. A u-calculus formula 
evaluates to a set of states over G, and the set can be computed by induction over 
the structure of the formula, where the fixed points are evaluated by iteration. 
We omit the (standard) semantics of formulas (see [20]). 


5.2 The Symbolic Algorithm 


We now present our new symbolic fixpoint algorithm to compute the winning 
region of Player 0 in the extremely fair adversarial game over G^ with respect to 
a Rabin winning condition R. A detailed correctness proof can be found in the 
extended version [4, App. B.3, pp. 40]. 


Theorem 2. Let G = (G, V^) be a game graph with live edges and R be a Rabin 
condition over G with index set P = [1; k]. Further, let Z* denote the fixed point 
of the following u-calculus expression: 


k 

VY», I Xpg. U VY, .AXp,. U UY. AXp,. ... U vYs HXpr- Uc, ; 
picP p2E€P\1 PkEP k1 j=0 

(6a) 


T We will justify the naming of this operator later in Rem. 1. 
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where Cp, = c= Ry.) n (Go N Cpre(Y,,,)) U (Apre(Yp,, X5,))] , (6b) 


with? po = 0, Gy, = 0 and Ry, :— 0 as well as P; := P\ (py... , pi). Then Z* 
is equivalent to the winning region YV of Player 0 in the extremely fair adver- 
sarial game over G* for the winning condition q in (2). Moreover, the fixpoint 
algorithm runs in O(n**?k!) symbolic steps, and a memoryless winning strategy 
for Player 0 can be extracted from it. 


5.3 Proof Outline 


Given a Rabin winning condition over a “normal” 2-player game, [27] provided a 
symbolic fixpoint algorithm which computes the winning region for Player 0. The 
fixpoint algorithm in their paper is almost identical to our fixpoint algorithm 
in (6): it only differs in the last term of the constructed C-terms in (6b). [27] 
defines the term Cp, as 


( pm Rp.) n [(G»; N Cpre(Yp,)) U (Cpre(Xp,))] . 


Intuitively, a single term Cp, computes the set of states that always remain within 


Qp; :— (Veo Rp, while always re-visiting G,,. That is, given the simpler (local) 
winning condition 
y :- DIQADOG (7) 
for two sets Q, G C V, the set 
VY. uX. Qn [(G N Cpre(Y)) U (Cpre(X))] (8) 


is known to define exactly the states of a “normal” 2-player game G from which 
Player 0 has a strategy to win the game with winning condition v [26]. Such 
games are typically called Safe Büchi Games. The key insight in the proof of 
Thm. 2 is to show that the new definition of C-terms in (6b) via the new al- 
most sure predecessor operator Apre actually computes the winning state sets 
of extremely fair adversarial safe Büchi games. Subsequently, we generalize this 
intuition to the fixpoint for the Rabin games. 


Fair Adversarial Safe Büchi Games: The following theorem characterizes 
the winning states in an extremely fair adversarial safe Büchi game. 


Theorem 3. Let G^ = (OG, V^) be a game graph with live vertices and Q,G C V 
be two state sets over G. Further, let 


Z* := vY. uX. Qn (Gn Cpre(Y)) U (Apre(Y, X))]. (9) 


Then Z* is equivalent to the winning region of Player 0 in the extremely fair ad- 
versarial game over G for the winning condition w in (T). Moreover, the fixpoint 
algorithm runs in O(n?) symbolic steps, and a memoryless winning strategy for 
Player 0 can be extracted from it. 


3 The Rabin pair (Gpo, Rp.) = (0,0) in (6) is artificially introduced to make the 
fixpoint representation more compact. It is not part of R. 
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Intuitively, the fixpoints in (8) and (9) consist of two parts: (a) A minimal 
fixpoint over X which computes (for any fixed value of Y) the set of states that 
can reach the “target state set" T := Q N Gn Cpre(Y) while staying inside the 
safe set Q, and (b) a maximal fixpoint over Y which ensures that the only states 
considered in the target 7' are those that allow to re-visit a state in 7' while 
staying in Q. 

By comparing (8) and (9) we see that our syntactic transformation only 
changes part (a). Hence, in order to prove Thm. 3 it essentially remains to show 
that this transformation works for the even simpler safe reachability games. 


Extremely Fair Adversarial Safe Reachability Games: A safe reachabil- 
ity condition is a tuple (T, Q) with T,Q C V and a play m satisfies the safe 
reachability condition (T,Q) if m satisfies the LTL formula 


Y := QUT. (10) 


A safe reachability game is often called a reach-while-avoid game, where the 
safe sets are specified by an unsafe set R := Q that needs to be avoided. Their 
extremely fair adversarial version is formalized in the following theorem and 
proved in the extended version [4, Thm. 3.3]. 


Theorem 4. Let G* = (G, V^) be a game graph with live edges and (T, Q) be a 
safe reachability winning condition. Further, let 


Z* := vY. uX. TU (Qn Apre(Y, X)). (11) 


Then Z* is equivalent to the winning region of Player 0 in the extremely fair 
adversarial game over G^ for the winning condition w in (10). Moreover, the fix- 
point algorithm runs in O(n?) symbolic steps, and a memoryless winning strategy 
for Player 0 can be extracted from it. 


To gain some intuition on the correctness of Thm. 4, let us recall that the 
fixpoint for safe reachability games without live edges is given by: 


uX. T U (Qn Cpre(X)). (12) 


Intuitively, the fixpoint computation in (12) is initialized with X? — () and 
computes a sequence X?, X!,..., X^ of increasing sets until X^ = X^*!, We 
say that v has rank r if v € X" V X"-1. All states contained in X” allow Player 0 
to force the play to reach T' in at most r — 1 steps while staying in Q. The 
corresponding Player 0 strategy po is known to be winning w.r.t. (10) and along 
every play m compliant with po, the path 7 remains in Q and the rank is always 
decreasing. 

To see why the same strategy is also sound in the extremely fair adversarial 
safe reachability game G^, first recall that for vertices v ¢ V* of G*, the operator 
Apre(X,Y) simplifies to Cpre(X). With this, we see that for every v ¢ V* a 
Player 0 winning strategy po in G^ can always force plays to stay in Q and to 
decrease their rank, similar to pọ. Then every play a compliant with such a 
strategy po and visiting a vertex in V* only finitely often satisfies (10). 


A Direct Symbolic Algorithm for Solving Stochastic Rabin Games 91 


Fig. 1. Fair adversarial game graph discussed in Ex. 1 and Ex. 2 with Player 0 and 
Player 1 vertices being indicated by circles and squares, respectively. The live vertices 
are V° = {2,3,5} (double square, blue), the target vertices are G = {6,9} (double 
circle, green), and the unsafe vertices are Q — (1) (red,dotted). 


The only interesting case for soundness of Thm. 4 is therefore every play 7 
that visits states in V^ infinitely often. However, as the number of vertices is 
finite, we only have a finite number of ranks and hence a certain vertex v € V* 
with a finite rank r needs to get visited by 7 infinitely often. From the definition 
of Apre, we know that only states v € V* are contained in X” if v has an outgoing 
edge reaching X^ with k < r. Because of the extreme fairness condition, reaching 
v infinitely often implies that also a state with rank k s.t. k < r will get visited 
infinitely often. As X! = T we can show by induction that T is eventually visited 
along m while 7 always remains in Q until then. 

In order to prove completeness of Thm. 4 we need to show that all states 
in V \ Z* are losing for Player 0. Here, again the reasoning is equivalent to the 
“normal” safe reachability game for v ¢ V*. For live vertices v € V*, we see 
that v is not added to Z* via Apre if v ¢ T and either (i) none of its outgoing 
edges make progress towards T or (ii) some of its outgoing edges leave Z*. One 
can therefore construct a Player 1 strategy that for (i)-vertices always choose 
an arbitrary transition and thereby never makes progress towards T' (also if v 
is visited infinitely often), and for (ii)-vertices ensures that they are only visited 
once on plays which remain in Q. This ensures that (ii)-vertices never make 
progress towards T' via their possibly existing rank-decreasing edges. 

In the extended version [4], we have provided a detailed soundness and com- 
pleteness proof of Thm. 4 along with the respective Player 0 and Player 1 strat- 
egy construction. In addition, there we also proved Thm. 3 using a reduction to 
Thm. 4 for every iteration over Y. 


Example 1 (Extremely Fair adversarial safe reachability game). We consider an 
extremely fair adversarial safe reachability game over the game graph depicted 
in Fig. 1 with target vertex set T = G = {6,9} and safe vertex set Q = V \ {1}. 

We denote by Y™ the m-th iteration over the fixpoint variable Y in (11), 
where Y? = V. Further, we denote by X"™ the set computed in the i-th iteration 
over the fixpoint variable X in (11) during the computation of Y" where X™ = 
Ø. We further have X"! = T = {6,9} as Apre(-,0) = 0. Now we compute 


x =T U (QN Apre(Y?, x) 
= {6,9} U (V X {1} n [Cpre(X ) U (Lpre? (X!) n Pre? (V))]) = {3, 5,6,7, 8, 9}. 
—— $< 


——— 
{7,8} {3,5} 
(13) 
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We observe that the only vertices added to X via the Cpre term are 7 and 
8. The live vertices 3 and 5 are added due to their outgoing edges leading to 
the target vertex 6. The additional requirement Prey(V) in Apre(Y°, X!) is 
trivially satisfied for all vertices at this point as Y? — V and can therefore be 
ignored. Doing one more iteration over X we see that now vertex 4 gets added 
via the Cpre term (as it is a Player 0 vertex that allows progress towards 5) and 
vertex 2 is added via the Apre term (as it is live and allows progress to 3). The 
iteration over X terminates with Y! = X!* = V \ (1). 

Re-iterating over X for Y! gives X?? = X!? = {3,5,6,7,8,9} as before. 
However, now vertex 2 does not get added to X?? because vertex 2 has an 
edge leading to V V Y! = {1}. Therefore the iteration over X terminates with 
Y? = X? = VA (1,2). When we now re-iterate over X for Y? we see that vertex 
3 is not added to X?? any more, as vertex 3 has a transition to VV Y? = (1,2). 
Therefore the iteration over X now terminates with Y? = X?* = V V {1,2,3}. 
Now re-iterating over X does not change the vertex set anymore and the fixed- 
point terminates with Y* = Y? = V V (1,2,3). 

We note that the fixpoint expression (12) for “normal” safe reachability 
games terminates after two iterations over X with X* = {6,7,8,9}, as ver- 
tices 7 and 8 are the only vertex added via the Cpre operator in (13). Due to 
the stricter notion of Cpre requiring that all outgoing edges of Player 0 vertices 
make process towards the target, (12) does not require an outer largest fixed- 
point over Y to “trap” the play in a set of vertices which allow progress when 
“waiting long enough”. This “trapping” required in (11) via the outer fixpoint 
over Y actually fails for vertices 2 and 3 (as they are excluded from the winning 
set of (11)). Here, Player 1 can enforce to “escape” to the unsafe vertex 1 in 
two steps before 2 and 3 are visited infinitely often (which would imply progress 
towards 6 via the existing live edges). 

We see that the winning region in the “normal” game is much smaller than the 
winning region for the extremely fair adversarial game, as adding live transitions 
restricts the strategy choices of Player 1, making it easier for Player 0 to win. 


Example 2 (Extremely fair adversarial safe Büchi game). We now consider an 
extremely fair adversarial safe Biichi game over the game graph depicted in Fig. 1 
with target set G = {6,9} and safe set Q = V \ {1}. 

We first observe that we can rewrite the fixpoint in (9) as 


VY. uX. [Qn Gn Cpre(Y)] U [QN (Apre(Y, X))]. (14) 


Using (14) we see that for Y? = V we can define T? := QN Gn Cpre(V) = G = 
{6,9}. Therefore the first iteration over X is equivalent to (13) and terminates 
with Y! = X'*-VXI. 

Now, however, we need to re-compute T for the next iteration over X and 
obtain T! = Q N Gn Cpre(Y!) = V \ {1} 1 {6,9} NV \ {1,2,9} = {6}. This 
re-computation of T! checks which target vertices are repeatedly reachable, as 
required by the Biichi condition. As vertex 9 has no outgoing edge trivially it 
cannot be reached repeatedly. 
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With this, we see that for the next iteration over X we only have one target 
vertex T! = {6}. Unlike the safe reachability case in Ex. 1, the vertex 7 cannot 
be added to X??, since Player 1 can always decide to take the edge towards 9 
from 7, and therefore prevents repeated visit of a target state. Vertices 2 and 3 
get eliminated for the same reason as in the safe reachability game within the 
second and third iteration over Y. The overall fixpoint computation therefore 
terminates with Y* = Y? = {4,5,6,8}. 


Proof of Thm. 2: The proof of Thm. 2 essentially follows from the same 
arguments as in the soundness proof of the Rabin fixpoint for 2-player game by 
Piterman et al. [27], which utilizes Thm. 4 and Thm. 3 at all suitable places. In 
[4, App. A, pp. 29], we illustrate the steps of the Rabin fixpoint in (6) using a 
simple extremely fair adversarial Rabin game with two Rabin pairs. 


Remark 1. We remark that the fixpoint (11), as well as the Apre operator, are 
similar in structure to the solution of almost surely winning states in concurrent 
reachability games [1]. In concurrent games, the fixpoint captures the largest 
set of states in which the game can be trapped while maintaining a positive 
probability of reaching the target. In our case, the fixpoint captures the largest 
set of states in which Player 0 can keep the game while ensuring a visit to the 
target either directly or through some of the edges from the live vertices. The 
commonality justifies our notation and terminology for Apre. 


Remark 2. [2] studied fair CTL and LTL model checking where the fairness con- 
dition is given by exteme fairness with all vertices of the transition system being 
live. They show that CTL model checking under this all-live fairness condition, 
can be syntactically transformed to non-fair CTL model checking. A similar 
transformation is possible for fair model checking of Büchi, Rabin, and Streett 
formulas. The correctness of their transformation is based on reasoning similar 
to our Apre operator. For example, a state satisfies the CTL formula VÓp under 
fairness iff all paths starting from the state either eventually visits p or always 
visits states from which a visit to p is possible. 


Complexity Analysis of (6): For Rabin games with & Rabin pairs, Piterman et 
al. [27] proposed a fixpoint formula with alternation depth 2k +1 . Using the ac- 
celerated fixpoint computation technique of Long et al. [23], they deduce a bound 
of O(n**!k!) symbolic steps. We can apply the same acceleration technique to 
our fixpoint (6), yielding a complexity upper bound of O(n**?k!) symbolic steps. 
(The additional complexity is because of an additional outermost v-fixpoint.) 


6 Experimental Evaluation 


We developed a C++-based tool Fairsyn?, which implements the symbolic fair 
adversarial Rabin fixpoint from Eq. (6) using Binary Decision Diagrams (BDD). 


? Repository URL: https://gitlab.mpi-sws.org/kmallik/synthesis-with-edge-fairness 
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Fairsyn has a single-threaded and a multi-threaded version, which respectively 
use the CUDD BDD library [32] and the Sylvan BDD library [11]. In both, we 
used a fixpoint acceleration procedure that ^warm-starts" the inner fixpoints by 
exploiting a monotonicity property (detailed in the extended version [4]). 

We demonstrate the effectiveness of our proposed symbolic algorithm for 21/2- 
player Rabin games using a set of synthetic benchmark experiments derived from 
the VLTS benchmark suite (Sec. 6.1) and a controller synthesis experiment for 
a stochastic dynamical system (Sec. 6.2); in the extended version [4], we include 
an additional software engineering benchmark example from the literature. In 
all of these examples, Fairsyn significantly outperformed the state-of-the-art. 

The experiments in Sec. 6.1 were performed using the multi-threaded Fairsyn 
on a computer equipped with a 3 GHz Intel Xeon E7 v2 processor with 48 CPU 
cores and 1.5 TiB RAM. The experiments in Sec. 6.2 were performed using the 
single-threaded Fairsyn on a Macbook Pro (2015) laptop equipped with a 2.7 GHz 
Dual-Core Intel Core i5 processor with 16 GiB RAM. 


6.1 The VLTS Benchmark Experiments 


We present a collection of synthetic benchmarks for empirical evaluation of the 
merits of our direct symbolic algorithm compared to the one using the reduction 
to 2-player games [7]; in the following, we refer the latter as the indirect approach. 
Like our direct algorithm, the indirect approach has been implemented in Fairsyn 
and benefits from the same Sylvan-based parallel BDD-library and accelerated 
fixpoint solution technique. We collect the first 20 transition systems from the 
Very Large Transition Systems (VLTS) benchmark suite [15]; their descriptions 
can be found in the VLTS benchmark website. For each of them, we randomly 
generated instances of 21/2-player Rabin games with up to 3 Rabin pairs using 
the following procedure: (i) we labeled a given fraction of the vertices as ran- 
dom vertices, (ii) we equally partitioned the remaining vertices into system and 
environment vertices, and (iii) for every set in R = {(Gi, Ri),..., (Gk, Ri) ), we 
randomly selected up to 596 of all vertices to be contained in the set. All the ver- 
tices in (i), (ii), and (iii) were selected randomly. In these examples, the number 
of vertices ranged from 289-164,865, the number of BDD variables ranged from 
9-18, and the number of transitions from 1224—2,621,480. 

In Fig. 2, we compare the running times of Fairsyn and the indirect approach. 
On the left scatter plot, every point corresponds to one instance of the randomly 
generated benchmarks, where the X and the Y coordinates represent the run- 
ning time for Fairsyn and the indirect approach respectively. The solid red line 
indicates the exact same performance for both methods, whereas the dashed 
red line indicates an order of magnitude performance improvement for Fairsyn 
compared to the indirect approach. Observe that Fairsyn was faster by up to 
two orders of magnitude for the majority of the cases. In the experiments, the 
memory footprint of Fairsyn and the indirect approach was similar. 

In the right plot, the X-axis corresponds to the proportion of random vertices 
within the set of vertices in percentage: 096 corresponds to a 2-player game and 
10096 corresponds to a Markov chain. The Y-axis corresponds to the running 
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time normalized with respect to the running time for the 096 case. We observe 
that Fairsyn was insensitive to the change of proportion of the random vertices. 
On the other hand, the indirect approach took longer time for larger proportion 
of random vertices, because for every random vertex it adds 3k 4- 2 additional 
vertices, thus causing a linear blowup in the size of the game graph. The big 
variations in the time differences of the two approaches are due to the varying 
size of the experiments: The larger a game graph is, the larger is the difference. 
Interestingly, for both Fairsyn and the indirect method, there is a dip in the 
running time when all the vertices are random (i.e. the 100% case), which is 
possibly due to faster computation of the Cpre and Apre operators and faster 
convergence of the fixpoint algorithm, owing to the absence of Player 0 and 
Player 1 vertices. 


10? 


109 = 


The indirect approach (s) 


Normalized running time 


107? ; 
107? 109 


Fairsyn (s) Fraction of random vertices (96) 


10? 0 20 40 60 80 100 


Fig. 2. LEFT: Comparison of running time of Fairsyn and the indirect approach on 
the VLTS benchmarks. All axes are in log-scale. RIGHT: Sensitivity of normalized 
running time w.r.t. variation of the proportion of random vertices. The blue and the red 
lines correspond to different instances of Fairsyn and the indirect approach respectively. 


6.2 Synthesis for Stochastically Perturbed Dynamical Systems 


Synthesizing verified symbolic controllers for continuous dynamical systems is an 
active area in cyber-physical systems research [33]. We consider a stochastically 
perturbed dynamical system model, called the bistable switch [12], which is an 
important model studied in molecular biology. The system model, call it X, has a 
continuous and compact two-dimensional state space X = [0, 4] x [0,4] € R? and 
a finite input space U = {—0.5,0,0.5} x {—0.5,0,0.5}. Suppose for any given 
time k € N, x(k), x2(k) are the two states, ui(k),u2(k) are the two inputs, 
and wi(k),we(k) are a pair of statistically independent noise samples drawn 


from a pair of distributions with bounded supports Wı = [-0.4, —0.2], W2 = 
[—0.4, —0.2] respectively. Then the states of X in the next time instant are: 
zi(k +1) = 2(k) + 0.05 (—1.321(k) + 2(k)) + ui (k) + wi(k), (15) 


xo(k +1) = x2(k) + 0.05 m — o5ca(5)) + u2(k) + wa(k). 


A controller C for X is a function C: X — U mapping the state z(Kk) at any 
time instant k to a suitable control input u(k). Then applying (15) repeatedly 
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Table 1. Performance comparison between Fairsyn and StochasticSynthesis (abbrevi- 
ated as SS) [12] on a comparable implementation of the abstraction (uniform grid-based 
abstraction). Col. 1 shows the size of the resulting 21/2-player game graph (computed 
using the algorithm given in [24]), Col. 2 and 3 compare the total synthesis times and 
Col. 4 and 5 compare the peak memory footprint (as measured using the “time” com- 
mand) for Fairsyn and SS respectively. “OoM” stands for out-of-memory. 


# vertices in Total synthesis time Peak memory footprint 
21/2-game abstraction Fairsyn SS Fairsyn SS 
3.8 x 10? 0.4s 30s 66 MiB 156 MiB 
2.2 x 10* 8.2s 55s 72 MiB 1 GiB 
1.1 x 10° 1min23s|16min1s| 108 MiB 81GiB 
6.6 x 10? 5min27s| OoM 166 MiB 126 GiB 
4.3 x 10° 41min7s| OoM 517 MiB 127 GiB 
with u(k) = C(z(k)), starting with an initial state (z4(0), z2(0)) = x(0) = zin; 


gives us an infinite sequence of states (x(0),x(1), x(2),...) called a path. For 
a fixed controller C and for a given initial state rij:, we obtain a probability 
measure PE , on the sample space of paths of X, in a way similar to how we 
obtained the probability measure P/$'^' over infinite plays of 2!/2-player games. 

Let y C X” be a Rabin specification, defined using a finite predicate over X. 


We extend the notion of almost sure winning for con- " 


trol systems in the obvious way: A state x € X of X is Cl A 
almost sure winning if there is a controller C such that C 
PC (i) = 1. The controller synthesis problem asks to E B 
compute an optimal controller C* such that for every 0 B 

almost sure winning state z, PC (o) = 1. 0 4 


Majumdar et al. [24] show that this synthesis prob- 
lem can be approximately solved by lifting the system 
X to a finite 21/2-player game. We used Fairsyn to solve the resulting 21/2-player 
Rabin games obtained for the controller synthesis problem for X in (15) and for 
the following specification given in LTL using the predicates A, B, C, D as shown 
in Fig. 3: e := (LIOB — 0C) A (0A > D-C). 

In Table 1, we compare the performance of Fairsyn against the state-of-the- 
art algorithm for solving this problem, which is implemented in the tool called 
StochasticSynthesis (SS) [12]. It can be observed that Fairsyn significantly out- 
performs SS for every abstraction of different coarseness considered here. 


Fig. 3. Predicates over X. 
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Abstract. In 2021, Casares, Colcombet, and Fijalkow introduced the 
Alternating Cycle Decomposition (ACD) to study properties and trans- 
formations of Muller automata. We present the first practical implemen- 
tation of the ACD in two different tools, Owl and Spot, and adapt it 
to the framework of Emerson-Lei automata, i.e., w-automata whose ac- 
ceptance conditions are defined by Boolean formulas. The ACD provides 
a transformation of Emerson-Lei automata into parity automata with 
strong optimality guarantees: the resulting parity automaton is minimal 
among those automata that can be obtained by duplication of states. 
Our empirical results show that this transformation is usable in practice. 
Further, we show how the ACD can generalize many other specialized 
constructions such as deciding typeness of automata and degeneraliza- 
tion of generalized Biichi automata, providing a framework of practical 
algorithms for w-automata. 


1 Introduction 


Automata over infinite words have many applications, including verification and 
synthesis of reactive systems with specifications given in formalisms such as Lin- 
ear Temporal Logic (LTL) [27, 23, 11, 12, 2, 29]. The synthesis problem from 
LTL specifications asks, given an LTL formula y, to build a controller that pro- 
cesses an input word letter by letter, producing an output word, such that the 
combined input-output-word satisfies y. The automata-theoretic approach to 
this problem (first introduced by Pnueli and Rosner [27]) consists of building a 
deterministic w-automaton A equivalent to the LTL specification v, then con- 
struct a game from .A in which the opponent chooses the input letters for the 
automaton, and finally solve this game and obtain a controller from a winning 
strategy (whenever such a strategy exists). The automaton A can use differ- 
ent kinds of acceptance conditions (Rabin, Emerson-Lei, Muller, parity...) and 
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thus we obtain games with different winning conditions. Among these games, 
parity games are the easiest to solve and there are highly-developed techniques 
for parity games solvers. Thus it is common practice to transform the automa- 
ton .A to a parity one (for which we might need to augment the state space 
of the automaton). The top-ranked tools in the SyntComp competitions [17], 
Strix [23] (winner in editions 2018, 2019, 2020 and 2021) and 1tlsynt [26], use 
this approach, producing a transition-based Emerson-Lei automata (TELA) as 
an intermediate step before constructing the parity automaton. For this reason, 
optimal and efficient procedures to transform Emerson-Lei automata into parity 
automata are of great importance. 


Emerson-Lei (EL) acceptance conditions (first defined by Emerson and Lei 
[10], and reinvented in the HOA format [3]) are arbitrary positive Boolean for- 
mulas over the primitives Inf(c) and Fin(c) where œs are colors from a set I’. A 
run is accepting if the set of colors F C 27 seen infinitely often is a satisfying as- 
signment to the EL acceptance condition (see Section 2 for a formal definition). 
Note that an explicit representation of all satisfying assignments is comparable 
to the Muller condition [15, Section 1.3.2]. Since the Boolean structure of LTL 
formulas can be mimicked by the Emerson-Lei acceptance conditions, a transla- 
tion of LTL formulas to Emerson-Lei automata is particularly convenient. 


Many algorithms to transform Emerson-Lei and Muller automata to parity 
have been proposed. In essence they all transform an automaton by turning 
each original state q into multiple states of the form (q,r) where r records some 
information about the current run, and transitions leaving (q,r) otherwise have a 
one-to-one mapping with those leaving q. Definition 3 calls this a locally bijective 
morphism, and we like to refer to those as algorithms that duplicate states. For 
instance in the Later Appearance Record (LAR) [16], r is a list of all colors 
ordered by most recent appearance, producing therefore a blow-up of |I|! in the 
state-space of the automaton. The State Appearance Record (SAR) [24, 22] is a 
variation of this idea for state-based conditions, and the Color Appearance Record 
(CAR) [28] is a variation for the Emerson-Lei condition. The Index Appearance 
Record (TAR) [24, 22, 20] is a specialized construction for Rabin and Streett 
conditions, where r is now an ordering of pair indices. These algorithms have 
no particular insights about the input acceptance condition, such as inclusion or 
redundancies between colors (or pairs). In the Zielonka-tree transformation [31], 
r is a reference to a branch in a tree representation of a Muller condition. That 
tree representation is tailored to the condition and allows such simplifications 
compared to previous methods (it can be proven to be always better [6, 25]). 
While none of these algorithms use the structure of the input automaton to 
optimize the produced automata, some heuristics have been proposed [28, 25, 21]. 


In 2021, inspired by the Zielonka tree, Casares et al. introduced the Alternat- 
ing Cycle Decomposition (ACD) of a Muller automaton [6]. Simply put, the ACD 
is a forest, i.e., a list of trees, that captures how accepting and rejecting cycles 
interleave in the automaton. They use the ACD to transform Muller automata 
into parity automata, and they prove a strong optimality result: the resulting 
automaton uses an optimal number of colors and has a minimal number of states 
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among those parity automata that can be obtained by duplicating states of the 
original one (see Theorem 1 for a formal statement). The main novelty of this 
transformation is that it does not only take into account the structure of both 
the acceptance condition and the automaton, but it exactly captures how they 
interact with each other. Moreover, Casares et al. [6] show that we can obtain 
some other valuable information about a Muller automaton from its ACD: for 
example the ACD can be used to decide typeness, i.e, if we can relabel it with 
another acceptance condition (parity, Rabin, Streett...). Their approach is pri- 
marily theoretical and puts the emphasis on how the ACD can be useful to 
obtain new results concerning Muller automata, but little is said about the costs 
of computing the ACD or the applicability of the transformation in practice. 


Contributions. In this paper, we show that the ACD is practical. We adapt the 
definition of the ACD to Emerson-Lei automata and the HOA format [3]. We 
implement the ACD and the associated transformation in two tools: Owl [18] 
and Spot [9], providing baselines for efficient implementations of these struc- 
tures. We show that the ACD gives a usable and useful method to transform 
Emerson-Lei automata into parity ones, improving upon any previous transfor- 
mation in terms of the size of the output parity automaton. We extend the ACD 
to produce state-based automata, and show that the ACD generally beats tradi- 
tional degeneralization-based procedures. Our implementation can also use the 
ACD to check typeness of deterministic automata. 


Structure of the paper. We begin by providing some common definitions in Sec- 
tion 2. In Section 3, we define the Alternating Cycle Decomposition, adapting 
the definition of Casares et al. [6] to Emerson-Lei automata, and we provide an 
algorithm to compute it. In Section 5, we study the transformation of Emerson- 
Lei automata into parity ones using the ACD and we show experimental results 
obtained by comparing the ACD-transform implemented in Spot and Owl with 
other commonly used transformations. In Section 6 we show experimental re- 
sults in the particular case of degeneralization of generalized Büchi automata. 
In Section 7 we discuss the utility of the ACD to decide typeness of automata. 


2 Preliminaries 


We denote by |A| the cardinality of a set A and by 2^ its power set. For a 
finite alphabet X, we write X* and X" for the sets of finite and infinite words, 
respectively, over X. The empty word is denoted by e. Given v € X*,w € X", 
we denote their concatenation by v -w and we write v E w if v is a prefix of w. 
We note inf(w) the set of letters that occur infinitely often in w. Given a map 
c: A — B and a subset A’ C A, we denote o| a> the restriction of o to A’. We 
extend g to A* and A” component-wise and we denote these extensions by c 
whenever no confusion arises. 

A (directed, edge-colored) graph is a pair G = (V, E) where V is a finite set 
of vertices and E C V x Ix V is a finite set of /-colored edges. Note that with 
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'Table 1: Encoding of common acceptance conditions into Emerson-Lei condi- 


tions. The variables c, co, c1, ... stand for arbitrary colors from the set I’. 

(B) Büchi Inf(c) 
(GB) generalized Büchi A, Inf(ci) 

(C) co-Büchi Fin(c) 
(GC) generalized co-Büchi V; Fin(c:) 

(R) Rabin V, (Fin(co;) ^ Inf (c2i+1)) 

(S) Streett AN; (Inf (c2:) V Fin(czi+1)) 

(P) parity min even Inf(0) v (Fin(1) A (Inf (2) v (Fin(3) ^ ...))) 


parity min odd Fin(0) ^ (Inf (1) v (Fin(2) ^ (Inf(3) v ...))) 


this definition one can have multiple differently colored edges from a vertex v to 
a vertex u. A graph G’ = (V', E") is a subgraph of G (written G' C G) if V' CV 
and E' C E. A graph G = (V, E) is strongly connected if for every pair of vertices 
(v, u) € V? there is a path from v to u. A strongly connected component (SCC) 
of a graph G is a maximal strongly connected subgraph of G. 


Emerson-Lei acceptance conditions. Let I = (0,...,n — 1} be a finite set of n 
integers called colors, from now on also written I' = {@,@,...} in our examples. 
We define the set EIL(I') of acceptance conditions according to the following 
grammar, where c stands for any color in I’: 


a :— T | L | Inf(c) | Fin(c) | (a ^a) | (a V a) 


Acceptance conditions are interpreted over subsets of I. For C C I we define 
the satisfaction relation C | a inductively according to the following semantics: 


CET C E Inf(c) iff ce C CE ay A ae iff C H a and C E ag 
CAL CEFin(c) iffic¢éC Cai Vag iffC H a or C H ag 


We denote by =a the negation of the acceptance condition a, i.e., Fin(m) be- 
comes Inf(m), and vice-versa, \ becomes V, etc. We assume that constants are 
propagated, i.e., a formula is either T, L, or does not contain T and L. 

Table 1 shows how common acceptance conditions can be encoded into 
Emerson-Lei conditions. Note that colors may appear multiple times; for in- 
stance (Fin(@) ^ Inf(@)) v (Fin(@) ^ Inf(@)) is a Rabin condition. 


Emerson-Lei automata. A transition-based Emerson-Lei automaton (TELA) is 
a tuple A = (Q, X, Qo, A, I, o), where Q is a finite set of states, X is a finite 
input alphabet, Qo C Q is a non-empty set of initial states, I’ is a set of colors, 
A C Qx x2! xQ isa finite set of transitions, and a € EL(I’) is an Emerson-Lei 
condition. The graph of A is the directed edge-colored graph G4 = (Q, E) where 
the edges E = ((q,C,q') : da € X. (q,a, C,q') € A} are obtained from A by 
removing X. We denote the transition (q,a, C,q’) € A and the edge (q, C, q) € E 


by q mE qd and q 2 q', respectively. Further, we might omit a or C if they are 
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clear from the context. We denote by y the projection of A or E to the set of 
colors I’. Given a word w = ag:a1:a3::: € XY”, a run over w in A is a sequence 
0 = (qo, 40, Co, q1): (q1, a1, C1, G2) -- € A? such that qo € Qo. The output of the 
run o, is the word y(@) € (27)”. A run o is accepting if inf(y(@)) F a. A word 
w € X" is accepted (or recognized) by .A if there exists an accepting run over 
w in A. We denote £(.A) the set of words accepted by A. Two automata A, A’ 
are equivalent if L(A) = L(A’). The size of an automaton, written |A], is the 
cardinality of its set of states. A state q € Q is reachable if there is a path from 
some state in Qo to q in Gy. 

An automaton A is deterministic if Qo is a singleton and for every q € Q 


and a € X there is at most one transition from q labeled with a, q Li q'€ A. 

We will use automata with acceptance defined over transitions (instead of 
stated-based acceptance) by default. However, in Sections 5 and 6 we will also 
discuss transformations towards automata with state-based acceptance. 

If the acceptance condition of an automaton is represented as a condition of 
kind X (cf. Table 1), we call it an X-automaton. We assume that each transition 
of a parity-automaton is colored with exactly one color; this can be achieved by 
substituting the set C in a transition q “S q' by min C (if C # Ø) or by (|D|4-1] 
if C = 0. (If C is a singleton we will omit the brackets in the notation). 


Labeled trees. A tree is a non-empty prefix-closed set T C N* whose elements 
are called nodes. It is partially ordered by the prefix relation; if x E y we say 
that x is an ancestor of y and y is a descendant of x (we add the adjective 
“strict” if moreover « Z y). The empty string € is the root of the tree. The set 
of children of a node x € T is Childrenr(x) = {x-i € T : i € N}. The set of 
leaves of T is Leaves(T) = (x € T : Childreny (x) = 0). Nodes belonging to a 
same set ChildrenT (x) are called siblings, and they are ordered from left to right 
by increasing value of their last component. If A is a set of labels, an A-labeled 
tree is a pair (T,7) of a tree T and a map 7: T > A. The depth of a node zx is 
Depth(x) = |x|. The height of T is Height(T) = max Depth(x). 


3 The Alternating Cycle Decomposition 


The Alternating Cycle Decomposition (ACD), proposed by Casares et al. [6], is 
a generalization of the Zielonka tree. The ACD of an automaton A is a forest, a 
collection of trees, labeled with accepting and rejecting cycles of the automaton. 
For each SCC of A we have a unique tree and the labeling of each tree alternates 
between accepting and rejecting cycles. Thus the ACD captures the complexity 
of the cycle structure of each SCC. We present now the definition of the ACD 
adapted to TELA. 

For the rest of this section, let A = (Q, X, Qo, A, I; o) be a TELA and let 
Ga = (Q, E) be the associated graph with edges colored by y: E — 27. We lift 
y to sets and define 7(£’) = U y(e) for every subset E' C E. 

ec E! 


104 A. Casares, A. Duret-Lutz, K.J. Meyer, F. Renkin, S. Sickert 


Definition 1. A cycle of A is a subset of edges L C E forming a closed path in 
Gy. A cycle £ is accepting (resp. rejecting) if y(£) F a (resp. y(£) É a). The set 
of states of a cycle € is States() = {q € Q : some e € L passes through q}. The 
set of cycles of A is denoted Cycles(A). It is (partially) ordered by set inclusion. 


Definition 2 ([6]). Let S$1,..., Sg be an enumeration of the strongly connected 
components of G4. The Alternating Cycle Decomposition of.A, denoted .ACD(.A), 
is a collection of k Cycles(A)-labeled trees (Ti, ..., Ti) with T; = (Ti, m) such that: 


— nile) is the set of edges of Si, for i — 1,...,k. 

— If x € T; and n(x) is an accepting cycle, then x has a child in T; for each 
maximal element in (€ € Cycles(.A) : L C n(x) and £ is rejecting}. In this 
case, we say that x is a round node. 

— If x € T; and m(x) is a rejecting cycle, then x has a child in Ti for each 
maximal element in {L € Cycles(A) : LC m(x) and £ is accepting). In this 
case, we say that x is a square node. 


Ifq € Q is astate belonging to the SCC S; in A, we define the tree associated 
to q as the subtree 7, = (Ty, q) given by: 


T; ={e}U{ee T; : q € States(m(z))) , n =nilz,- 


Remark 1. We provide examples online at https:/ /spot.lrde.epita.fr/ipynb/zlk 
tree.html and an executable copy of this notebook is included in the artifact [8]. 


4 An Efficient Computation of the ACD 


In this section we give an algorithm to compute the Alternating Cycle Decom- 
position of an Emerson-Lei automaton A, implemented in Owl [18] and Spot [9]. 
'This can be done by first computing an SCC-decomposition of G4 which gives us 
the labels of the roots of the trees (7,,..., Tp), and then recursively computing 
the children of the nodes of each tree, following the definition of ACD(A). Algo- 
rithm 1 shows how to compute the children of a given node and uses notation 
we introduce now. 

Let C C I" be a subset of colors and let S = (Qs, Es) C Gy be a subgraph. 
We define the projection of S on C, denoted Sjc = (Qs, EG), as the subgraph 
of S obtained by removing the edges e € Eg such that y(e) É C, that is, 
Ey = {(¢,D,q') € Es : D € Cj. We write Colors(S) = U.er, (e). We say 
that S’ C S is an C-strongly connected component in S (C-SCC) if it is an SCC 
of S and Colors(S’) = C. Further, maxc is the set of all maximal elements 
according to the partial order defined by C. 

Note that Algorithm 1 uses Algorithm 2, which simplifies the Emerson-Lei 
conditions before passing the formula to a Max-SAT function (a SAT-solver 
that computes maximal satisfying assignments, e.g., by clause blocking) [4]. This 
preprocessing ensures that the ACD for Rabin or Streett acceptance conditions 
can be constructed without making use of the general purpose algorithm for 
computing maximal satisfying assignments. 
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Algorithm 1 Computing the children of a node. 


1: Input: A cycle S = n;(x) corresponding to the label of a node x of ACD(A). 
2: Output: The set of labels for the children of x, (Si,...,Sx). 
3: function Compute-Children(S) 


4 children + 0, C + Colors(S) 
5 if CF a then > Maximal subsets D C C such that DE a & CFa 
6: {Ci,...,Ck} — Max-Satisfying-Subsets(C, 7a) 
T: else 
8: {C1,..., Ck} < Max-Satisfying-Subsets(C, a) 
9: for D € (Ci,..., Ck} do 
10: for S’ € SCCs of S,p do > These might not be D-SCC in S 
11: if Colors(S’)F a & DF o then 
12: children + children U (S') 
13: else 
14: children < children U Compute-Children(S’) 
15: return maxc children > Remove from children non-maximal cycles 


Algorithm 2 The subprocedure Max-Satisfying-Subsets. 

1: Input: A subset of colors C C I and an EL condition a € EL(L). 
2: Output: maxc(D C C : DF a}. 

3: function Max-Satisfying-Subsets(C, a) 


d if CF a then 

5 return (C) 

6: a + alif c € C then c else 1] > Replace colors not in C by false 
T L 4+ (c € € : ^c does not occur in a} 

8: if L #0 then 

9: a + al[if c € L then T else c] > Replace colors in L by true 
10: (C:,..., Ck} € Max-Satisfying-Subsets(C \ L, a) 
11: return (C1U L,..., C, UL} 
12: if a = ^c V--- V ^c, then 
13: return {{c1,...,en}\ {ci}: 1<i< n}} 


14: return Max-SAT(o) 


Memoization. To optimize the construction of the ACD and to avoid duplicated 
recursive calls, we perform two kinds of memoization: First, we memoize the 
results of calling Algorithm 2 from Algorithm 1. (Thus we implicitly construct 
a Zielonka DAG for a.) Second, we memoize the recursive calls to Algorithm 1: 
this is useful, as distinct nodes in the ACD can be labeled by the same cycles. 


5 From Emerson-Lei to Parity Automata 


In this section we describe the transformation from TELA to parity automata 
using the Alternating Cycle Decomposition [6]. This transformation provides 
strong optimality guarantees: the resulting parity automaton has minimal size 
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among those that can be produced without merging states from the TELA and 
it uses an optimal number of colors (Theorem 1). We also show that this trans- 
formation can be adapted to produce state-based automata. Note that in this 
case we loose the first optimality guarantee. 


5.1 The ACD Transformation 


Let A = (Q, X, Qo, A4, T,a) be a TELA and let ACD(A) = (3,..., Tx). We 
introduce the following notation that will allow us to move in the ACD. 

Given a transition e = q ae q such that both q and q’ belong to the i-th 
SCC of A and a node z € T;, we define Support(x,e) to be the least ancestor z 
of x in Ti such that e € mi(z). If Support(z, e) # x and it is not a leaf in Jy, let 
z’ be the only child of Support(x,e) that is an ancestor of x, and let y1,...,ys 
be an enumeration from left to right of the nodes in Childrenr, , (Support(r, e)). 
We define NextBranch(xz, e) as: 


Support(r,e), if Support(x,e) = x or if Support(z, e) is a leaf in Ty, 
Ui; if z' = Us; 
Yj+1, ifz’=y;, LAJ E: 


We define a parity automaton P 4cp(A) = (P, X, Po, Ap, Tp, 8) (ACD transform 
of A) equivalent to A as follows: 


States. The states of P 4cp(a) are of the form (q, x), for q € Q and z a leaf of 
the tree associated to q. Initial states are of the form (qo, £) with qo € Qo is 
an initial state in A and z is the leftmost leaf on its corresponding tree. 


pu U {q} x Leaves(T4), Po = ((qo, £) : qo € Qo, x the leftmost leaf in 74,5. 
qEQ 


Transitions. For each transition e = q BE q' in A and each state (q, x) € P, 
let us define a transition (q, x) SEES (q', y) in Ap as follows: first, q' is the 
destination state for the original transition. If q and q’ are not in the same 
SCC then y is defined as the leftmost leaf in 7; and p — 1 (except if all 7; 
have height 1 and a rounded root: in that case p — 0). Otherwise, if both q 
and q' belong to the i-th SCC of A, then the destination leaf y is the leftmost 
descendant of NeztBranch(z,e) in Ty. 

We define the color p of the transition as Depth(Support(x, e)), if the root 
of Ti is a round node (n;(€) F o), or as Depth(Support(x, e)) + 1 otherwise. 
We remark that in this way, p is even if and only if 7;(z) F a. 


Parity condition. The condition f is a parity min even condition (cf. Table 1). 


Remark 2. If the color 0 does not appear on any transition then we shift all 
colors by —1 and replace £8 by a parity min odd condition. 


Proposition 1 ([6]). The automaton Pacpia) recognizes L(A). 
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Remark 3. The ACD transformation preserves many properties (determinism, 
completeness, good-for-gameness, unambiguity...) of the automaton A, see [6]. 


Remark 4. Since the number of colors used by P4cp(A) is at most the height of 
a tree in ACD(A), we obtain that P 4cp(.4j never uses more colors than |I| + 1. 
Furthermore, since the TELA does not require all transitions to have a color, we 
can omit the maximal one and produce an automaton with at most |I| colors. 


In order to state the optimality of this transformation we introduce the 
notion of locally bijective morphisms of automata. Given an automaton A = 
(Q, X, Qo, A, I, a) and q € Q, we denote Out 4(q) the set of outgoing transitions 


of q, i.e., OutA(q) = {q we, qd@eEA:aEeEX CCI, EQ}. 


be two EL automata over X. A locally bijective morphism from A to A’ (denoted 
p: A A’) is a pair of maps po: Q > Q', pa : A — A’ such that: 


— Lela, is a bijection between Qo and Qo. 


iC C 
- pal(n GL q2) = eo(a) “> eg(q») for some C' C I". 
— For every q € Q, Palout.a(q) i5 a bijection between Out 4(q) and Outa (vo(q)) 
— For every run o € A" in A, o is accepting iff palo) is accepting in A’. 


Theorem 1 ([6]). Let A be an Emerson-Lei automaton, and let Pcp(A, be 
the parity automaton obtained by applying the ACD transformation. Then, 


— There is a locally bijective morphism o : Pacepa) > A. 

— If 'P' is a parity automaton admitting a locally bijective morphism to A, then 
[Pacp(a)| < [P']. 

— If 'P' is a parity automaton recognizing L(A), P’ uses at least as many colors 
as P AcD(A) . 


Note that all state-duplicating constructions mentioned in the introduction 
create locally bijective morphisms. Thus the above theorem shows that the ACD 
transformation duplicates the least number of states. 


5.2 Experimental Results 


Figures 1 and 2 compare four different paritization procedures applied to 1065 
TELA generated? from LTL formulas from the Synthesis Competition. These 
automata have between 2 and 55 colors (mean 5.92, median 5) and between 
1 and 245761 states (mean 2023.20, median 20). Automata with fewer than 2 
colors have been ignored since they are trivial to paritize. 

The procedures are Owl’s and Spot's implementation of ACD transform, as 
well as Spot’s implementation of the Zielonka Tree transform [6], and Spot’s 
previous paritization function (called to parity) [28]. We refer the reader to 
Section 8 for information about the used versions. T'wo dotted lines on the sides 


5 We used 1tl2tgba -G -D from Spot, and 1t12dela from Owl. 
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Fig. 1: Comparison of the output size of the four paritization procedures. 
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Fig.2: Time spent performing these four paritization procedures. 


of the plots hold cases that did not finish within 500 seconds (red, inner line), 
or where the tool reported an error? (orange, outer line). Pink dots represent 
input automata that already have parity acceptance: for those, running the ACD 
transform still makes sense as it will produce an output with a minimal number 
of colors. However, Owl’s implementation, which mostly cares about reducing the 
number of states, uses a shortcut and will return the input automaton unmodified 
in this case: this explains the pink cloud on the left of Figure 2. 

Owl’s and Spot’s implementations of the ACD transform produce automata 
with the same size, as expected. The cases that are not on the diagonal all 
correspond to timeouts or tool errors. The Zielonka Tree transform, which does 
not take the automaton structure into consideration, produces automata that 
are on the average 2.11 times bigger (median 1.60), while its runtime is on the 
average 6.55 times slower (median 0.97). Lastly, Spots to_parity function is 
not far from the optimal size given by ACD transform: on the average its output 
is 3.28 times larger, but the median of that size ratio is 1.00. Similarly, it is on 
the average 15.94 times slower, but with a median of 1.04. 


$ Either *out-of-memory", or “too many colors" as Spot is restricted to 32 colors. 
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5.3 ACD Transformation Towards State-Based Parity Automata 


Sometimes it is desired to obtain an automaton with the acceptance defined over 
states. A state-based parity automaton is a tuple A = (Q, X, Qo, A,¢: Q > N) 
where (Q, X, Qo, A) is the underlying structure defined as for transition-based 
automata in Section 2 (with the only difference that A C Q x X x Q now), and 
o: Q — N is a map associating colors to states. A run over A is accepting if the 
minimal color visited infinitely often is even. 

Let A be a TELA with ACD(A) = (71,..., Tk). We define an equivalent 
state-based parity automaton P, Acp(A) = (P, X, Po, Ap, 9: P — N) as follows: 


States. States are of the form (q,x), for q € Q and x € T, (now the second 
component corresponds to a node of the ACD that is not necessarily a leaf). 
The set of initial states is the same as for P Acp(A): 


P= U {a} x Ty, Po = {(q0, x£) : qo € Qo, x the leftmost leaf in Tao}. 
qeQ 


Transitions. For each transition e = q 2 qd' € A and (q,x) € P we define 
one transition (q, x) “> (q’,y) € Ap. To specify the destination node y, we 
distinguish two cases: 

Suppose that x is a leaf in 74. If NeztBranch(z, e) is not the leftmost child 
of Support(z, e) in Ty, then y is the leftmost leaf below NextBranch(z, e) in 
JT, (as in the transition-based case). If NextBranch(z, e) is the leftmost child 
(a “lap” around Support(x, e) is finished), then we set y = Support(z, e). 

If x is not a leaf in 7,, the destination y is determined exactly as if the 
transition started in (q, x’) for x’ the leftmost leaf in T; under x. 

Parity condition. ¢((q,x)) = Depth(z), if the root of 7, is a round node, and 
$((g,a)) = Depth(x) + 1 otherwise. 


Note that we do not have the same optimality guarantee as in the transition- 
based case: If z is not a leaf in its corresponding tree, then the states of the form 
(q,2) € P are not necessarily reachable in P.,4cp(.4). We only need to add 
those that can be reached from the initial state. However, the set of reachable 
states does depend on the ordering of the children in the trees of the ACD, and 
therefore the size of the final automaton depends on this ordering. 

We propose a heuristic to order the children of nodes in ACD(A). Let 7; be 
a tree in ACD(A) and x € T;. We define: 


Di(z)-— (q' EQ : q5 q d (x), for some q € States(n;(x)),a € X}. 
The heuristic consists in ordering the children of a node 7; by decreasing |D;(x)]. 


Experiments involving transformations towards state-based automata and test- 
ing this heuristic can be found in Section 6.2. 


110 A. Casares, A. Duret-Lutz, K.J. Meyer, F. Renkin, S. Sickert 
6 Degeneralization of Generalized Büchi Automata 


'The transformation of generalized-Büchi automata with n colors into Büchi au- 
tomata (with a single color) is known as “degeneralization” and has been a 
very common processing step between algorithms that translate temporal-logic 
formulas into generalized-Büchi automata, and model-checking algorithms that 
(used to) only work with Büchi automata. While it initially consisted in making 
2" copies of the GBA [30, Appendix B] to remember the set of colors that had 
yet to be seen, degeneralization to state-based Büchi acceptance can be done us- 
ing only n 4- 1 copies once an arbitrary order of colors has been selected [13]. A 
similar construction to transition-based Büchi acceptance requires only n copies 
of the original automaton. Different orders of colors may lead to a different num- 
bers of reachable states in the Büchi automaton. Some tools even attempted to 
start the degeneralization in different copies to reduce the number of reachable 
states [14]. Nowadays, an implementation such as the degeneralization of Spot 
implements several SCC-based optimizations [2] to reduce the number of output 
states, but is still sensitive to the arbitrary order selected for colors. 


6.1 Transition-based Degeneralization 


'This order-sensitivity of the degeneralization, even in its transition-based vari- 
ant, makes a striking difference with ACD. When applied to a generalized Büchi 
automaton that has some accepting and rejecting paths, the ACD-transform pro- 
duces an automaton with acceptance Inf(@) v Fin(@). Since all transitions are ei- 
ther labeled by © or @, color @ is superfluous’ and the condition can be reduced 
to Inf(@). In this context, ACD-transform therefore gives us a transition-based 
Büchi automaton by duplicating the fewest number of states (Theorem 1(2)). 

It can be seen that the cycling around the different children of the ACD 
(whose ordering is arbitrary) performed during ACD-transform is similar to the 
process used in traditional degeneralization. What makes the latter sensitive to 
color ordering is that it only “sees” one transition at a time, while the ACD 
provides a view of the cycles. For instance a degeneralization would process 
the sequence @O> (y)-69» differently from the sequence (z)-£»(V)-9» (2) 
depending on the order in which colors are expected to be encountered. However, 
if there is no other transition reaching or leaving ©) the two colors will always be 
seen together so their order should not matter: the two transitions belong to the 
same node of the ACD. The propagation of colors [28] is a related preprocessing 
step that can improve the degeneralization by propagating all colors common to 
the incoming transitions of a state to its outgoing transitions and vice-versa. It 
would turn the previous situation into @0O@> @) making the color 
order selected by the degeneralization irrelevant (in this case). 

A comparison of the output size of the traditional degeneralization imple- 
mented in Spot (which includes several optimizations learned over the years) 


T In an automaton with “parity min” acceptance where all transitions are colored, the 
maximal color can always be omitted and replaced by the empty set. 
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11 0 case — 11 4 0 case 
10 above diag. 10 above diag. 
$ 9 $ 9 
E: 8 
on mo" 
v 7 Nd 
2 E 
k T 
B 5 B? 
4 44 20 5 419 cases 4 32 18 16 5 235 cases 
3431 34 1 below diag. 344 below diag. 
581 cases x 765 cases 
on diag. 3 4 5 6 7 8 9 10 11 on diag. 3 4 5 6 7 8 9 10 11 
TBA.degen (states) 'TBA.degen. propagate (states) 


Fig. 3: Two-dimensional histogram of the sizes of 1000 automata, degeneralized 
to transition-based Büchi automata, using Spot's degeneralization function (with 
or without propagation of colors), or using ACD-transform. 


against that of ACD-transform is given in the left plot of Figure 3. Unsurpris- 
ingly, because of ACD-transform's optimality, there are no cases where ACD 
loses to Spot's transition-based degeneralization. The use of the propagation of 
colors (right of the plot) is an improvement (the non-optimal cases dropped from 
419 to 235) but not a cure. 


Remark 5. The input automata used in this section and the next one is a set of 
1000 randomly generated, minimal, deterministic, transition-based generalized 
Büchi automata, with 3 or 4 states and 2 or 3 colors. The reason for using such 
small minimal automata is to be able to use a SAT-based minimization [1] on 
the degeneralized state-based output in the next section to estimate how large 
the gap between an optimal and our procedure is. 


6.2 State-based degeneralization 


If ACD is used to produce a state-based output, as explained in Subsection 5.3, 
the obtained automaton is not guaranteed to be minimal with respect to locally 
bijective morphisms. In this case we can obtain a weaker optimality result: 


Proposition 2. Let A be a generalized Büchi automaton, and let Bsp- ACD(A) 
be the state-based Biichi automaton obtained by applying the ACD state-based 
transformation. If B' be is a state-based Biichi automaton admitting a locally 
bijective morphism to A, then |Bs,—acp(A)| € |B'| + IA]. 


Proof. Let B' be a state-based Büchi automaton admitting a locally bijective 
morphism to A. We can transform it into a transition-based Biichi automaton 


trans DY setting the transitions leaving accepting states to be accepting. This 


automaton has the same size than B’ and it also accepts a locally bijective 


morphism to A. Therefore, by Theorem 1, we have that |B 4cpi4)| € |Birans| = 


112 A. Casares, A. Duret-Lutz, K.J. Meyer, F. Renkin, S. Sickert 


15 4402 cases 15 7 498 cases 
14 7 above diag. 14 4 above diag. 
4 13 ^4 13 
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on diag. 3 4 5 6 7 8 9 101112131415  ondiag. 3 4 5 6 7 8 9 10111213 
SBA.acd (states) SBA.acd.heuristic (states) 


Fig.4: Comparison of three ways to degeneralize to state-based Büchi: (acd, 
acd.heuristic) using the state-based version of ACD-transform with or without 
heuristic, and (degen) classical degeneralization. 


94 cases 


mn 
w 
[v 
o 
w 
n 
[o] 
[^] 


~ , ag 

E 12 4 above diag. E above diag. 

S 11 88 

aA, 2 

"s ST 

3 9 E 

*" 8 E6 

o o 

a7 A 

EE d» 

Q Q 

$5 5 9 

< a4 

M 4 241 cases e 0 case 

S 3 421 below diag. n 3 below diag. 
756 cases 555 cases 
on diag. 3 4 5 6 7 8 9 101112131415 on diag. 3 4 5 6 7 8 9 

SBA.acd (states) SBA.minimal (states) 


Fig. 5: Effect of the heuristic for ordering children of the ACD, and comparison 
to the minimal degeneralized automata (when known). 


[B'|, where B 4cp(A) is the transition-based automaton obtained applying the 
ACD-transformation. We claim that |Bsy.. Acpoo| € [Bacpcnl + |A| (therefore 
implying that [B5 Acp(4)| € |B'| - |.A]). Indeed, the set of states of Bs- Acp(A 
is the union of the set of states of B 4cp(A) and a subset of nodes of the form 
(q, €), where € is the root of T}. There are at most |A| nodes of this form. 


Figure 4 compares three ways to perform state-based degeneralization. The 
ACD comes in two variants, with or without the heuristic of Section 5.3, and it 
is compared against the state-based degeneralization of Spot. 

Figure 5 shows how the heuristic variant compares to the one without, and 
how it compares with the size of a minimal DBA, when its size could be computed 
in reasonable time (in 649 cases). Note that there might not be a local bijective 
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morphism between the input automaton and the minimal DBA computed this 
way, nonetheless these minimal size automata can serve as a reference point to 
estimate the quality of a degeneralization. Compared to this subset of minimal 
DBA, the average number of additional states produced by the state-based ACD 
is 0.17 with heuristics, and 0.33 without. Comparatively, Spot's degeneralization 
has an average of 1.21 extra states. 


7 Deciding Typeness 


We highlight now how the ACD can be used to decide typeness of deterministic 
TELA. This problem, first introduced by Krishnan and Brayton [19], consists of 
deciding whether we can replace the acceptance condition of a given automaton 
by another (hopefully simpler) without changing the transition structure and 
preserving the language (see Table 1 for a list of common acceptance conditions). 

Let A = (Q, X, Qo, A, T, o) be a TELA. We say that A is X-type, for X € 
(B, C, GB,GC, P, R, S], if there is an X-automaton over the same structure, 
A’ = (Q, X, Qo, A, I", 8) (where A and A’ only differ on the coloring of the 
transitions), such that L(A) = £(A’) and 8 belongs to X. We emphasize that 
we permit to use a different set of colors I" in A’. Some conditions can always 
be rewritten as conditions of other kinds (for example, Büchi conditions can be 
expressed as parity ones, so being B-type implies being P-type). We should not 
confuse this notion with the expressive power of deterministic automata using 
these conditions. For example, both deterministic parity automata and Rabin 
automata recognize all w-regular languages, but there are Rabin automata that 
are not parity-type. Further, we say that an automaton A is weak if for every 
SCC S of A, all cycles in S are accepting or all of them are rejecting. 

The following result shows that the ACD is a sufficient data structure for 
deciding typeness for many common acceptance conditions. We remark that the 
second item adds to the results of Casares et al. [7] (this statement only holds if 
transitions of automata are labeled with subsets of colors, which is not allowed 
in their model). 


Proposition 3 ([7, Section 5.2]). Let A be a deterministic TELA such that 
all its states q € Q are reachable and let ACD(A) = (T1,..., Tk) be its Alternat- 
ing Cycle Decomposition. Then the following statements hold: 


1. A is Rabin-type (resp. Streett type) if and only if for every q € Q, every round 
node (resp. square node) of T, has at most one child in Tq. It is parity-type 
if and only if it is both Rabin and Streett-type. 

2. A is generalized Büchi-type (resp. generalized co-Biichi-type) if and only if 
for every 1 € i € k, Height(T;) € 2 and in case of equality, the root of T; is 
a round node (resp. square node). 

3. .A is weak if and only if for every 1 € i € k, Height(T;) — 1. 


Also, the least number of colors used by a deterministic parity automaton 
recognizing L(A) is max Height(T;) +v, where v = 0 if the root of all trees of 
a 


maximal height have the same shape (round or square), and v = 1 otherwise. 
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If one of the previous conditions holds, then ACD(A) also provides an effec- 
tive procedure to relabel A with the corresponding acceptance condition. 


Remark 6. 'The ACD gives a typeness result for each SCC of the automaton, 
which allows to simplify the acceptance condition of each of them indepen- 
dently. Further, implications from right to left in Proposition 3 also hold for 
non-deterministic automata. 


Proposition 3 provides an effective procedure to check typeness of TELA: 
we just have to build the ACD and verify that it has the appropriate shape. 
Spot’s implementation of ACD has options to abort the construction as soon 
as it detects that the shape is wrong. Moreover, if an automaton is parity-type, 
the ACD provides a method to relabel the automaton with a minimal number 
of colors. Finally, if the automaton already has parity acceptance, the ACD 
transformation boils down to the algorithm of Carton and Maceiras [5]. 


8 Availability 


The ACD and the transformations based on it are currently implemented in two 
open-source tools: Spot 2.10 [9] and Owl 21.0 [18]. (The original developments 
were independent before the authors met and worked on this joint paper.) 

In Spot 2.10, the ACD can be played with using the Python bindings. The acd 
class implements the decomposition, and will render it as an interactive forest of 
nodes that can be clicked to highlight the relevant cycles in the input automaton. 
The acd. transform() and acd. transform sbacc() implements the transition- 
based and state-based variant of the paritization procedure. Additionally, the 
acd class has options to heuristically order the children to favor the state-based 
construction, or to abort the construction as soon as it is clear that the ACD 
does not have Rabin or Street shape (in case one wants to use it to establish 
typeness of automata). All these features are illustrated at https:/ /spot.lrde.ep 
ita.fr/ipynb/zlktree.html. In the future, ACD will be used more by the rest of 
Spot, and will be one option of the 1t1synt tool (for LTL synthesis). 

In Owl, the ACD transformation is available through the aut2parity com- 
mand. This command reads an automaton in the HOA format [3] using arbi- 
trary acceptance, and produces a parity automaton in the same format. The tool 
Strix [23], which builds upon Owl, gained in version 21.0.0 the option to use the 
ACD-construction as an intermediate step. 

Instructions to reproduce all experiments and included in the artifact [8]. 


9 Conclusion 


We have shown that ACD is more than a theoretically-appealing construction: 
our two implementations show that the construction is very usable in practice, 
and provide a baseline for further improvements. We have also shown that ACD is 
a Swiss-army knife for w-automata in the sense that it can generalize and replace 
several specific constructions (paritization, degeneralization, typeness checks). 
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Abstract. We propose several heuristics for mitigating one of the main causes 
of combinatorial explosion in rank-based complementation of Biichi automata 
(BAs): unnecessarily high bounds on the ranks of states. First, we identify elevator 
automata, which is a large class of BAs (generalizing semi-deterministic BAs), 
occurring often in practice, where ranks of states are bounded according to the 
structure of strongly connected components. The bounds for elevator automata 
also carry over to general BAs that contain elevator automata as a sub-structure. 
Second, we introduce two techniques for refining bounds on the ranks of BA states 
using data-flow analysis of the automaton. We implement out techniques as an 
extension of the tool RANKER for BA complementation and show that they indeed 
greatly prune the generated state space, obtaining significantly better results and 
outperforming other state-of-the-art tools on a large set of benchmarks. 


1 Introduction 


Biichi automata (BA) complementation has been a fundamental problem underlying 
many applications since it was introduced in 1962 by Biichi [8,17] as an essential part of 
a decision procedure for a fragment of the second-order arithmetic. BA complementation 
has been used as a crucial part of, e.g., termination analysis of programs [13,20,10] or 
decision procedures for various logics, such as $18 [8], the first-order logic of Sturmian 
words [33], or the temporal logics ETL and QPTL [38]. Moreover, BA complementation 
also underlies BA inclusion and equivalence testing, which are essential instruments in 
the BA toolbox. Optimal algorithms, whose output asymptotically matches the lower 
bound of (0.76n)” [43] (potentially modulo a polynomial factor), have been devel- 
oped [37,1]. For a successful real-world use, asymptotic optimality is, however, not 
enough and these algorithms need to be equipped with a range of optimizations to make 
them behave better than the worst case on BAs occurring in practice. 

In this paper, we focus on the so-called rank-based approach to complementation, 
introduced by Kupferman and Vardi [24], further improved with the help of Friedgut [14], 
and finally made optimal by Schewe [37]. The construction stores in a macrostate partial 
information about all runs of a BA A over some word a. In addition to tracking states 
that A can be in (which is sufficient, e.g., in the determinization of NFAs), a macrostate 
also stores a guess of the rank of each of the tracked states in the run DAG that captures 
all these runs. The guessed ranks impose restrictions on how the future of a state might 
look like (i.e., when A may accept). The number of macrostates in the complement 
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depends combinatorially on the maximum rank that occurs in the macrostates. The 
constructions in [24,14,37] provides only coarse bounds on the maximum ranks. 

A way of decreasing the maximum rank has been suggested in [15] using a PSPACE 
(and, therefore, not really practically applicable) algorithm (the problem of finding the 
optimal rank is PSPAcE-complete). In our previous paper [19], we have identified several 
basic optimizations of the construction that can be used to refine the tight-rank upper 
bound (TRUB) on the maximum ranks of states. In this paper, we push the applicability 
of rank-based techniques much further by introducing two novel lightweight techniques 
for refining the TRUB, thus significantly reducing the generated state space. 

Firstly, we introduce a new class of the so-called elevator automata, which occur 
quite often in practice (e.g., as outputs of natural algorithms for translating LTL to 
BAs). Intuitively, an elevator automaton is a BA whose strongly connected components 
(SCCs) are all either inherently weak! or deterministic. Clearly, the class substantially 
generalizes the popular inherently weak [6] and semi-deterministic BAs [11,3,4]). The 
structure of elevator automata allows us to provide tighter estimates of the TRUBs, 
not only for elevator automata per se, but also for BAs where elevator automata occur 
as a sub-structure (which is even more common). Secondly, we propose a lightweight 
technique, inspired by data flow analysis, allowing to propagate rank restriction along 
the skeleton of the complemented automaton, obtaining even tighter TRUBs. We also 
extended the optimal rank-based algorithm to transition-based BAs (TBAs). 

We implemented our optimizations within the RANKER tool [18] and evaluated our 
approach on thousands of hard automata from the literature (15 9o of them were elevator 
automata that were not semi-deterministic, and many more contained an elevator sub- 
structure). Our techniques drastically reduce the generated state space; in many cases we 
even achieved exponential improvement compared to the optimal procedure of Schewe 
and our previous heuristics. The new version of RANKER gives a smaller complement in 
the majority of cases of hard automata than other state-of-the-art tools. 


2 Preliminaries 


Words, functions. We fix a finite nonempty alphabet X and the first infinite ordinal 
w = (0,1,...). For n € w, by [n] we denote the set {0,...,n}. For i € w we use 
i] to denote the largest even number smaller of equal to i, e.g., || 42] = [43] = 42. 
An (infinite) word a is represented as a function a: w — X where the i-th symbol is 
denoted as a;. We abuse notation and sometimes also represent « as an infinite sequence 
æ = qo... We use X^ to denote the set of all infinite words over È. For a (partial) 
function f: X — Y and aset S € X, we define f(S) = (f(x) | x € S). Moreover, for 
x € X and y € Y, weuse f «(x + y) to denote the function ( fA (x => f(x)}JU{x e y). 


Büchi automata. A (nondeterministic transition/state-based) Büchi automaton (BA) 
over È is a quadruple A = (Q, ô, 1, Qr Uór) where Q is a finite set of states, 6: QXX > 
22 is a transition function, I C Q is the sets of initial states, and Qr C Q and óp C ô are 
the sets of accepting states and accepting transitions respectively. We sometimes treat 6 


as a set of transitions p 5 q, for instance, we use p = q € 6 to denote that q € 6(p,a). 


! An SCC is inherently weak if it either contains no accepting states or, on the other hand, all 
cycles of the SCC contain an accepting state. 
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Moreover, we extend 6 to sets of states P C Q as 6(P,a) = Upep ó(p. a), and to sets 
of symbols I € È as ó6(P, T) = User 6(P, a). We define the inverse transition function 
usd = {p 5 qalq 5 p € ô}. The notation 2" for S C Q is used to denote the 
restriction of the transition function ôN (S x X x S). Moreover, for q € Q, we use A[q] 
to denote the BA (Q, ô, {q4}, Qr U ôF). 

A run of A from q € Q on an input word a is an infinite sequence p: w — Q that 
starts in q and respects ô, i.e., pọ = q and Vi > 0: pj 5 Pi+1 € ô. Let infg(p) denote 
the states occurring in p infinitely often and inf 5 (p) denote the transitions occurring in p 
infinitely often. The run p is called accepting iff info (p) Qr + 0 orinfs(p)nór # 0. 

A word a is accepted by A from a state q € Q if there is an accepting run p of A 
from q, i.e., po = q. The set La(q) = {a € X? | A accepts a from q} is called the 
language of q (in A). Given a set of states R C Q, we define the language of R as 
La(R) = Uger £a(q) and the language of A as L(A) = £4 (1). We say that a state 
q € Q is useless iff La(q) = 0. If ôr = 0, we call A state-based and if Or = 0, we 
call A transition-based. In this paper, we fix a BA A = (Q,60,I, Qr U ôf). 


3 Complementing Büchi automata 


In this section, we describe a generalization of the rank-based complementation of state- 
based BAs presented by Schewe in [37] to our notion of transition/state-based BAs. 
Proofs can be found in [16]. 


3.1 Run DAGs 


First, we recall the terminology from [37] (which is a minor modification of the one 
in [24]), which we use in the paper. Let the run DAG of A over a word œ be a DAG 
(directed acyclic graph) Ga = (V, E) containing vertices V and edges E such that 


-VCOQOxos..(q,i) € V iff there isa run p of A from J over a with o; = q, 
- ECVxV s.t. ((q,i), (q',i)) e Eiff i! =i+1andq’ € ó(q,a;). 


Given Gag as above, we will write (p,i) € Ga to denote that (p,i) € V. A vertex 
(p.i) € V is called accepting if p is an accepting state and an edge ((q, i), (q’,i’)) € E 
is called accepting if q 3 q' is an accepting transition. A vertex v € Ga is finite if the 
set of vertices reachable from v is finite, infinite if it is not finite, and endangered if it 
cannot reach an accepting vertex or an accepting edge. 
We assign ranks to vertices of run DAGs as follows: Let G9 = Ga and j = 0. Repeat 
the following steps until the fixpoint or for at most 2n + 1 steps, where n = |Q|. 
— Set ranka(v) € j for all finite vertices v of GL and let gi" be Gi minus the 
vertices with the rank j. 
— Set rankg(v) — j +1 for all endangered vertices v of gi" and let gi? be gi" 
minus the vertices with the rank j + 1. 
- Set j —j+2. 


For all vertices v that have not been assigned a rank yet, we assign ranka(v) — w. 
We define the rank of œ, denoted as rank(ao), as max{rankg(v) | v € Ga} and the 
rank of A, denoted as rank(A), as max(rank(w) | w E X? V L(A)}. 


Lemma 1. Ifa € L(A), then rank(a) < 2|Q]. 
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3.20 Rank-Based Complementation 


In this section, we describe a construction for complementing BAs developed in the work 
of Kupferman and Vardi [24 |—later improved by Friedgut, Kupferman, and Vardi [14], 
and by Schewe [37]—extended to our definition of BAs with accepting states and tran- 
sitions (see [19] for a step-by-step introduction). The construction is based on the notion 
of tight level rankings storing information about levels in run DAGs. For a BA A and 
n = |Q], a (level) ranking is a function f: Q — [2n] such that f(Qr) € (0,2,...,2n), 
i.e., f assigns even ranks to accepting states of A. For two rankings f and f’ we define 
f$ f' iff for each q € S and q’ € ó(q,a) we have f'(q') € f(q) and for each 
q” € ôr (q,a) it holds f'(q") < || f(q)]. The set of all rankings is denoted by R. For 
a ranking f, the rank of f is defined as rank( f) = max{f(q) | q € Q}. We use f x f’ 
iff for every state q € Q we have f(q) x f'(q) and we use f < f’ iff f < f’ and there 
is a state q € Q with f(q) < f'(q). For a set of states $ C Q, we call f to be S-tight if 
(i) it has an odd rank r, (ii) f(S) 2 (1,3,...,r), and (iii) f(Q VS) = (0). A ranking is 
tight if itis Q-tight; we use 7 to denote the set of all tight rankings. 

The original rank-based construction [24] uses macrostates of the form (S, O, f) to 
track all runs of A over a. The f-component contains guesses of the ranks of states 
in S (which is obtained by the classical subset construction) in the run DAG and the 
O-set is used to check whether all runs contain only a finite number of accepting states. 
Friedgut, Kupferman, and Vardi [14] improved the construction by having f consider 
only tight rankings. Schewe’s construction [37] extends the macrostates to (S, O, f, i) 
with 7 € w representing a particular even rank such that O tracks states with rank i. 
At the cut-point (a macrostate with O = 0) the value of i is changed to i + 2 modulo the 
rank of f. Macrostates in an accepting run hence iterate over all possible values of i. 
Formally, the complement of A = (Q,6, I, Qr U ôr) is given as the (state-based) BA 
ScHEeweE(A) = (Q’, 6’, I, Q^. U 0), whose components are defined as follows: 


— Q’ = Qı U Qə where 
e Q; = 22 and 
e Oo -((5,0, f,i) € 22 x 29 xT x (0,2,...,2n — 2} | f is S-tight, 
OCSNf ON 
- l’={I}, 
— 6’ = 6, U6_2 U 63 where 
e 01: Q4 x  — 22: such that 6,(S, a) = (6(S,a)), 
e 03: Q1 x E — 29? such that ós(S,a) = ((5,0,f,0) | S = ó(S,a), 
f is S’-tight}, and 
e 53: Q2 x X — 29? such that (S’,O’, f’,i’) € 63((S, O, f, i), a) iff 


* $'—ó6(S,a), 

* fef, 

* rank(f) = rank( f"), 
* and 


o if O =0 theni’ 2 (142) mod (rank(f^) +1) and O' = f"! (i^), and 
o if O + 0 then ï’ =i and O’ = ó(O,a) n f" ! (i); and 
- OF, = (0) U ((22 x (0) x T x o) n Qə). 


We call the part of the automaton with states from Q4 the waiting part (denoted as 
WAITING), and the part corresponding to Qə the tight part (denoted as TiGHr). 
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Theorem 2. Let A be a BA. Then £L(Scuewe(A)) = X? N L(A). 


The space complexity of Schewe's construction for BAs matches the theoretical 
lower bound O((0.76n)”) given by Yan [43] modulo a quadratic factor O(n”). Note that 
our extension to BAs with accepting transitions does not increase the space complexity 
of the construction. 


b a 
Example 3. Consider the BA A over C) : GX Xo 
{a,b} given in Fig. la. A part of Sd. 
ScHEWE(A) is shown in Fig. 1b (we use b 


(15:0, 1:1), 0) to denote the macrostate (a) BA A over (a, b) 
({s,t},0,{s => 0,1 tb 154,0). We a,b 


omit the i-part of each macrostate since (9) 
. . (r1, 52,13), 0 a b 
the corresponding values are O for all 
" a 

macrostates in the figure. Useless states a NY 

2 gure. U WC a dies) — (0) 
are covered by grey stripes. The full au- bA, i a iS 
tomaton contains even more transitions " Y 

r:1, 8:0, 1:0), 0 r1, 8:0, £51), 0 5:0, 1:1). 0 
from {r} to useless macrostates of the € l %0) ( 25.9) LED 
b 


form ({r:-, si, ti}, 0). o 

From the construction of SCHEWE(A), 
we can see that the number of states is ( 
affected mainly by sizes of macrostates (b) A part of ScHewE( A) 
and by the maximum rank of A. In par- 
ticular, the upper bound on the number 
of states of the complement with the maximum rank r is given in the following lemma. 


{r:1, 5:0, 1:0), (s. 1)) 


Fig. 1: Schewe's complementation 


Lemma 4. For a BA A with sufficiently many states n such that rank(A) = r the 
(r+m)"” 
(r+m)! 


number of states of the complemented automaton is bounded by 2" + where 


m = max{0, 3 - [5]} 


From Lemma 1 we have that the rank of A is bounded by 2|Q|. Such a bound 
is often too coarse and hence ScHEwE(.A) may contain many redundant states. De- 
creasing the bound on the ranks is essential for a practical algorithm, but an optimal 
solution is PSPAcE-complete [15]. The rest of this paper therefore proposes a framework 
of lightweight techniques for decreasing the maximum rank bound and, in this way, 
significantly reducing the size of the complemented BA. 


3.3 Tight Rank Upper Bounds 
Let a ¢ L(A). For £ € w, we define the ¢-th level of Ga as levels (£) = {q | (q, £) € 


Ga}. Furthermore, we use tr to denote the ranking of level £ of Ga. Formally, 


ar ~ _ }ranka((q,€)) if q € levela(€), 
Je (a) = i otherwise. (1) 


We say that the £-th level of Ga is tight if for all k > £ it holds that (1) f is tight, and 
(ii) rank( f°) = rank( fF). Let p = SoS1...Se-1(Se, Or, fe, ip) ... be arun on a word 
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æ in SCHEWE(A). We say that p is a super-tight run [19] if fk = f? for each k > £. 
Finally, we say that a mapping u: 22 — R is a tight rank upper bound (TRUB) wrt a iff 


H € w: level (£) is tight ^ (Vk 2 £: p(levela(k)) = ff). (2) 


Informally, a TRUB is a ranking that gives a conservative (i.e., larger) estimate on 
the necessary ranks of states in a super-tight run. We say that u is a TRUB iff u 
is a TRUB wrt all a ¢ L(A). We abuse notation and use the term TRUB also for 
a mapping u’: 22 — w if the mapping inner(u’) is a TRUB where inner(u')(S) = 
{q= m | m= u'(S) - lif q € Qr else m = u'(S)) for all S € 22. (= is the monus 
operator, i.e., minus with negative results saturated to zero.) Note that the mappings 
Hu = (S e (2|S \ Or| > 1)! scoo and inner(p;) are trivial TRUBs. 

The following lemma shows that we can remove from ScHEWwE(A) macrostates 
whose ranking is not covered by a TRUB (in particular, we show that the reduced 
automaton preserves super-tight runs). 


Lemma 5. Let u be a TRUB and 8 be a BA obtained from ScHEwE(A) by replacing 
all occurrences of Q2 by Q5 = {(S,O, f.i) | f € u(S)). Then, L(B) = E? \ L(A). 


4 Elevator Automata 


In this section, we introduce elevator automata, which are BAs having a particular 
structure that can be exploited for complementation and semi-determinization; elevator 
automata can be complemented in O(16") (cf. Lemma 10) space instead of 20 (7 lo), 
which is the lower bound for unrestricted BAs, and semi-determinized in O(2") instead 
of O(4") (cf. [16]). The class of elevator automata is quite general: it can be seen 
as a substantial generalization of semi-deterministic BAs (SDBAs) [11,5]. Intuitively, 
an elevator automaton is a BA whose strongly connected components are all either 
deterministic or inherently weak. 

Let A = (O,6,1, Or Uóp).C € Q isa strongly connected component (SCC) of A 
if for any pair of states q, q’ € C it holds that q is reachable from q’ and q’ is reachable 
from q. C is maximal (MSCC) if itis not a proper subset of another SCC. An MSCC C is 
trivial iff |C| = 1 and Ol = 0. The condensation of A is the DAG cond(A) = (M, &) 
where M is the set of A’s MSCCs and & = ((C1, C3) | 3q1 € Ci, dq € Co, da € 
È: qı = q2 € ô}. An MSCC is non-accepting if it contains no accepting state and no 
accepting transition, i.e., CN Qr = 0 and ó|c NOF = 0. The depth of (M, €) is defined 
as the number of MSCCs on the longest path in (M, €). 

We say that an SCC C is inherently weak accepting (IWA) iff every cycle in the 
transition diagram of A restricted to C contains an accepting state or an accepting 
transition. C is inherently weak if it is either non-accepting or IWA, and A is inherently 
weak if all of its MSCCs are inherently weak. A is deterministic iff |I| < 1 and 
ló(q,a)| € 1 for all q € Q anda € X. An SCC C € Q is deterministic iff (C, Slo 0,0) 
is deterministic. A is a semi-deterministic BA (SDBA) if A[q] is deterministic for every 
qe Qru{pEeQ|s 5 p € ôF,s E€ Q,a E ÈŁ}, i.e., whenever a run in A reaches an 
accepting state or an accepting transition, it can only continue deterministically. 
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A is an elevator (Büchi) automaton iff for —— — 222-2- -------—- 7 
every MSCC C of A it holds that C is (i) deter- he: ^ 
ministic, (ii) IWA, or (iii) non-accepting. In other | 
words, a BA is an elevator automaton iff every --- 
nondeterministic SCC of A that contains an ac- = | 2- J---/------ . 
cepting state or transition is inherently weak. An 
example of an elevator automaton obtained from 
the LTL formula GF(a v GF(b v GFc)) is shown COE 
in Fig. 2. The BA consists of three connected de- — ^ | ae 
terministic components. Note that the automaton 
is neither semi-deterministic nor unambiguous. | 

The rank of an elevator automaton A does 
not depend on the number of states (as in general 
BAs), but only on the number of MSCCs and the 
depth of cond(A). In the worst case, A consists of a chain of deterministic components, 
yielding the upper bound on the rank of elevator automata given in the following lemma. 


aa ^b ^c 


~a ^3b^c 


c zA 


Fig.2: The BA for LTL formula 
GF (a v GF(b v GFc)) is elevator 


Lemma 6. Let A be an elevator automaton such that its condensation has the depth d. 
Then rank( A) < 2d. 


4.1 Refined Ranks for Elevator Automata 


Notice that the upper bound on ranks provided by Lemma 6 can still be too coarse. For 
instance, for an SDBA with three linearly ordered MSCCs such that the first two MSCCs 
are non-accepting and the last one is deterministic accepting, the lemma gives us an 
upper bound on the rank 6, while it is known that every SDBA has the rank at most 3 
(cf. [5]). Another examples might be two deterministic non-trivial MSCCs connected 
by a path of trivial MSCCs, which can be assigned the same rank. 

Instead of refining the definition of elevator automata into some quite complex list of 
constraints, we rather provide an algorithm that performs a traversal through cond (A) 
and assigns each MSCC a label of the form that contains (1) a type and 
(ii) a bound on the maximum rank of states in the component. The types of MSCCS that 
we consider are the following: 

T: trivial components, 

IWA: inherently weak accepting components, 

D: deterministic (potentially accepting) components, and 
N: non-accepting components. 

Note that the type in an MSCC is not given a priori but is determined by the 
algorithm (this is because for deterministic non-accepting components, it is sometimes 
better to treated them as D and sometimes as N, depending on their neighbourhood). 
In the following, we assume that A is an elevator automaton without useless states 
and, moreover, all accepting conditions on states and transitions not inside non-trivial 
MSCCS are removed (any BA can be easily transformed into this form). 

We start with terminal MSCCs C, i.e., MSCCs that cannot reach any other MSCC: 


T1: If C is IWA, then we label it with | IWA:O |. 


T2: Else if C is deterministic accepting, we label it with (D:2). 
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€=max{f€p, v +1, lw} t = max{fp + @, fn +1, £w +0,2} £= max{fp +1,fy, fw +1} 
C: (IWA:£ C: 

(D:fp) (N:£y ) (IWAstw ) (D:tp (N:tw) (IWA:tw (Dp) (N:£y ) (IWA:£w ) 
(a) C is IWA (b CisD (c) C is N 


Fig.3: Rules for assigning types and rank bounds to MSCCs. The symbols @ and @ 
are interpeted as 0 if all the corresponding edges from the components having rank £p 
and fw, respectively, are deterministic; otherwise they are interpreted as 2. Transi- 
tions between two components C, and C» are deterministic if the BA (C, 6 lc 0, 0) is 
deterministic for C = 0(C4, X) N (C4 U C2). 
C: 

(Note that the previous two options are complete due 
to our requirements on the structure of A.) When 
all terminal MSCCs are labelled, we proceed through 
cond(A), inductively on its structure, and label non- 
terminal components C based on the rules defined below. 
The rules are of the form that uses the structure depicted in Fig. 4, where children nodes 


(D:tp) (New ) (IWA:Ew ) 


Fig.4: Structure of elevator 
ranking rules 


denote already processed MSCCs. In particular, a child node of the form denotes 
an aggregate node of all siblings of the type k with £x being the maximum rank of these 
siblings. Moreover, we use typemax{ep, en, ew } to denote the type j € (D, N, IWA} 
for which e; = max(ep, en, ew} where e; is an expression containing £; (if there are 
more such types, j is chosen arbitrarily). The rules for assigning a type ft and a rank £ 
to C are the following: 


I1: If C is trivial, we set t = typemax{€p, £y, €w} and £ = max{€p, £n, Cw). 

I2: Else if C is IWA, we use the rule in Fig. 3a. 

I3: Else if C is deterministic accepting, we use the rule in Fig. 3b. 

14: Else if C is deterministic and non-accepting, we try both rules from Figs. 3b and 3c 
and pick the rule that gives us a smaller rank. 

I5: Else if C is nondeterministic and non-accepting, we use the rule in Fig. 3c. 


Then, forevery MSCC C of A, we assign ——X———"———X— P — P 
each of its states the rank of C. We use 
X: Q o to denote the rank bounds 
computed by the procedure above. 


Lemma 7. y isa TRUB. Gane, EC uoo 


Y 
(r1, s:0, 1:0). 0) (67:1, 9:0, 1:1}, 0)}) (5:0. 1:1), 0)) 


Using Lemma 5, we can now use y 
to prune states during the construction 
of SCHEWE(A), as shown in the follow- 
ing example. i 


I 
l 
1 
1 b 
I 
1 
1 
D 
1 
1 


Example 8. As an example, consider Fig.5: A part of Scugwe(2t). The TRUB 
the BA A in Fig. la. The set of computed by elevator rules is used to prune 
MSCCs with their types is given as states outside the yellow area. 
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{{r}:N, (5, t}: WA} showing that BA A is an elevator. Using the rules T1 and I4 
we get the TRUB y = {r:1, 5:0, 7:0). The TRUB can be used to prune the generated 
states as shown in Fig. 5. oO 


4.2 Efficient Complementation of Elevator Automata 


In Section 4.1 we proposed an algorithm for assigning ranks to MSCCs of an elevator 
automaton A. The drawback of the algorithm is that the maximum obtained rank is not 
bounded by a constant but by the depth of the condensation of A. We will, however, 
show that it is actually possible to change A by at most doubling the number of states 
and obtain an elevator BA with the rank at most 3. 

Intuitively, the construction copies every non-trivial MSCC C with an accepting 
state or transition into a component C°, copies all transitions going into states in C to 
also go into the corresponding states in C*, and, finally, removes all accepting conditions 
from C. Formally, let A = (Q, ô, I, Qr U ôr) be a BA. For C € Q, we use C° to denote 
a unique copy of C, i.e., C° = (q* | qe C) s.t. C° NQ = 0. Let M be the set of MSCCs 
of A. Then, the deelevated BA DEELEv(A) = (Q’, 6’, I’, Qj, U õp) is given as follows: 


=O =OU0", 
- 6 : Q' x E — 2? where for q € Q 
e ó'(q, a) = 6(q, a) U (6(q,a))° and 
e ó'(q',a) = (6(q,a) N C)* forq E Ce M; 
-[l-I,and 
- Qi, = Qr and ô} = (q* 5 r° |q S reóg] nó. 


It is easy to see that the number of states of the deelevated automaton is bounded by 2|Q]. 
Moreover, if A is elevator, so is DEELEv(.71). The construction preserves the language 
of A, as shown by the following lemma. 


Lemma 9. Let A be a BA. Then, L(A) = £(DEELEV(A)). 


Moreover, for an elevator automaton A, the structure of DEELEV(A) consists of (after 
trimming useless states) several non-accepting MSCCs with copied terminal deter- 
ministic or IWA MSCCs. Therefore, if we apply the algorithm from Section 4.1 on 
DEELEv(A), we get that its rank is bounded by 3, which gives the following upper 
bound for complementation of elevator automata. 


Lemma 10. Let A be an elevator automaton with suffficiently many states n. Then the 
language 3:9 \ L(A) can be represented by a BA with at most O(16") states. 


The complementation through DEELEv(A) gives a better upper bound than the rank 
refinement from Section 4.1 applied directly on A, however, based on our experience, 
complementation through DEELEv(A) behaves worse in many real-world instances. 
This poor behaviour is caused by the fact that the complement of DEELEv(A) can have 
a larger WAITING and macrostates in TicuT can have larger S-components, which can 
yield more generated states (despite the rank bound 3). It seems that the most promising 
approach would to be a combination of the approaches, which we leave for future work. 
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€=max{lp,ty + 1, lw. lg} t = max(£p + 8. £N +1, £w +@,fG+2,2} £= max((p * L (y. Cw +1,fG +1} 


RD ^a 
(D:£p ) (N:£n ) (IWA:tw ) (G:ta)) (D:£p ) (Nztw ) (IWA:tw ) (G:t) (D:tp) (N:éy ) (IWA:tw ) (G:tc ) 


(a) C is IWA (b) Cis D (c) C is N 


Fig. 6: Rules assigning types and rank bounds for non-elevator automata. 


43 Refined Ranks for Non-Elevator Automata 


The algorithm from Section 4.1 computing a TRUB for elevator automata can be 
extended to compute TRUBs even for general non-elevator automata (i.e., BAs with 
nondeterministic accepting components that are not inherently weak). To achieve this 
generalization, we extend the rules for assigning types and ranks to MSCCs of elevator 
automata from Section 4.1 to take into account general non-deterministic components. 
For this, we add into our collection of MSCC types general components (denoted as G). 
Further, we need to extend the rules for terminal components with the following rule: 


T3: Otherwise, we label C with | G:2]C V Or||. £= max{ép, £n + 1, fw. fg} + 2IC \ Or| 


C: (G£ 
Moreover, we adjust the rules for assigning oe L 
a type ¢ and a rank £ to C to the following (the a 
rule I1 is the same as for the case of elevator 


automata): Fig.7: Cis G 


I2-15: (We replace the corresponding rules for their counterparts including general 
components from Fig. 6). 
I6: Otherwise, we use the rule in Fig. 7. 


| D:€p | | Nw Jl IWA:£w || G:lg | 


Then, for every MSCC C of a BA A, we assign each of its states the rank of C. Again, we 
use y: Q o to denote the rank bounds computed by the adjusted procedure above. 


Lemma 11. y isa TRUB. 


5 Rank Propagation 


In the previous section, we proposed a way, how to 
obtain a TRUB for elevator automata (with gener- 
alization to general automata). In this section, we 
propose a way of using the structure of A to re- 
fine a TRUB using a propagation of values and thus 
reduce the size of TrGur. Our approach uses data 
flow analysis [32] to reason on how ranks and rankings of macrostates of SCHEWE(A) 
can be decreased based on the ranks and rankings of the local neighbourhood of the 
macrostates. We, in particular, use a special case of forward analysis working on 
the skeleton of ScuEwE(. A), which is defined as the BA Ka = (22,6’,0,0) where 
ó' — (R Sg | S = 6(R, a)} (note that we are only interested in the structure of Ka and 


(u(R) ] (u(R2)] 7 (u(Ri) ] 


aj a2 am 


Fig. 8: Rank propagation flow 
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not its language; also notice the similarity of Ka with WarrtiNo). Our analysis refines 
a rank/ranking estimate u(S) for a macrostate S of Ka based on the estimates for its 
predecessors R1,..., Rm (see Fig. 8). The new estimate is denoted as u’ (S). 

More precisely, u: 22 — V isa function giving each macrostate of Kg a value from 
the domain V. We will use the following two value domains: (1) V = w, which is used for 
estimating ranks of macrostates (in the outer macrostate analysis) and (ii) V = R, which 
is used for estimating rankings within macrostates (in the inner macrostate analysis). For 
each of the analyses, we will give the update function up: (22 — V) x (29) > Y, 
which defines how the value of (S) is updated based on the values of u(R1), .. ., u (Ry). 
We then construct a system with the following equation for every S € 22: 


u(S) = up(u, S, Ry,..., Rm) where {R1,...,Rm} 20^ 1(S, X). (3) 


We then solve the system of equations using standard algorithms for data flow analysis 
(see, e.g., [32, Chapter 2]) to obtain the fixpoint u*. Our analyses have the important 
property that if they start with uo being a TRUB, then y* will also be a TRUB. 

As the initial TRUB, we can use a trivial TRUB or any other TRUB (e.g., the output 
of elevator state analysis from Section 4). 


5.1 Outer Macrostate Analysis 


We start with the simpler analysis, which is the outer macrostate analysis, which 
only looks at sizes of macrostates. Recall that the rank r of every super-tight run in 
ScHEWE(A) does not change, i.e., a super tight run stays in WArriwG as long as needed 
so that when it jumps to TIGHT, it takes the rank r and never needs to decrease it. We can 
use this fact to decrease the maximum rank of a macrostate S in Ka. In particular, 
let us consider all cycles going through S. For each of the cycles c, we can bound the 
maximum rank of a super-tight run going through c by 2m — 1 where m is the smallest 
number of non-accepting states occurring in any macrostate on c (from the definition, 
the rank of a tight ranking does not depend on accepting states). Then we can infer that 
the maximum rank of any super-tight run going through S is bounded by the maximum 
rank of any of the cycles going through S (since S can never assume a higher rank in 
any super-tight run). Moreover, the rank of each cycle can also be estimated in a more 
precise way, e.g. using our elevator analysis. 

Since the number of cycles in Ka can be large?, instead of their enumeration, we em- 
ploy data flow analysis with the value domain V = w (i.e, for every macrostate S of Ka, 
we remember a bound on the maximum rank of S) and the following update function: 


UP out t, S, Ri, ttt Rm) = min{u(S), max(u(Ri), ett »H(Rm) }}- (4) 


Intuitively, the new bound on the maximum rank of S is taken as the smaller of the 
previous bound u(S) and the largest of the bounds of all predecessors of S, and the new 
value is propagated forward by the data flow analysis. 


2 Kg can be exponentially larger than A and the number of cycles in K can be exponential to 
the size of Kä, so the total number of cycles can be double-exponential. 
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Example 12. Consider the BA Aex in 
Fig. 9a. When started from the initial 
TRUB uo = {{p} = Lípq) o 
3, (p.q.r,s) e 7} (Fig. 9b), outer 
macrostate analysis decreases the max- 
imum rank estimate for {p,q} to 1, 


—(t)a 


{p,q})}:1 


a a 
Y 
(p.q.r.s)):7 Íp.q.r.s)]:7 
a a 


since min(uo((p, q}, max{uo({p})}} = (a) Aes (uo (©) Hout 

min{3,1} = 1. The estimate for  Fig.9: Example of outer macrostate anal- 
(p.q.r.s) is not affected, because ysis. (a) Ace (e denotes accepting transi- 
min{7, max(1, 7)) = 7 (Fig. 9c). o tons). The initial TRUB jo in (b) is refined 


to u% „+ in (c). 


Lemma 13. Jf u is a TRUB, then u < (S up; (pu, S, Ry... Rm)} isa TRUB. 


Corollary 14. When started with a TRUB yo, the outer macrostate analysis terminates 
and returns a TRUB y^, ,. 


5.2 Inner Macrostate Analysis 


Our second analysis, called inner macrostate analysis, looks deeper into super-tight 
runs in ScHEwe(A). In particular, compared with the outer macrostate analysis from 
the previous section—which only looks at the ranks, i.e., the bounds on the numbers 
in the rankings—, inner macrostate analysis looks at how the rankings assign concrete 
values to the states of A inside the macrostates. 

Inner macrostate analysis is based on the following. Let p be a super-tight run of 
ScHEWE(A) on « € L(A) and (S, O, f, i) be a macrostate from TrGur. Because p is 
super-tight, we know that the rank f(q) of a state q € S is bounded by the ranks of the 
predecessors of q. This holds because in super-tight runs, the ranks are only as high as 
necessary; if the rank of q were higher than the ranks of its predecessors, this would 
mean that we may wait in Warriwc longer and only jump to q with a lower rank later. 

Let us introduce some necessary notation. Let f, f' € R berankings (i.e., f, f’: Q > 
w). We use f U f’ to denote the ranking {q — max{f(q), f'(q)) | q € Q}, and 
ff’ to denote the ranking {q + min{f(q), f'(q)) | q € Q}. Moreover, we define 
max-succ-ranks(f) = max«(f' € R | f-ex$f') and a function dec: R — R such that 
dec(@) is the ranking 6’ for which 


0(q) -1 if 0(q) = rank(0) and q € Qr, 
6'(q) 2 110(4) - 1] if0(g) = rank(@) and q € Qr, (5) 
0(q) otherwise. 


Intuitively, max-succ-ranks ( f) is the (pointwise) maximum ranking that can be reached 
from macrostate S with ranking f over a (it is easy to see that there is a unique such 
maximum ranking) and dec(0) decreases the maximum ranks in a ranking 0 by one 
(or by two for even maximum ranks and accepting states). 

The analysis uses the value domain V = R (i.e., each macrostate of Ka is assigned 
a ranking giving an upper bound on the rank of each state in the macrostate) and 
the update function up;, given in the right-hand side of the page. Intuitively, up,,, 
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updates u(q) for every q € S 
to hold the maximum rank com- 
patible with the ranks of its pre- 
decessors. We note line Line 6, 
which makes use of the fact that 
we can only consider tight rank- 
ings (whose rank is odd), so we 
can decrease the estimate using 
the function dec defined above. 


UP in (Uu, S,Ri,.-.,Rm): 
foreach 1 <i € m and a € X: do 
if 6(R;, a) = S then 
| 8; — max-succ-rankg. (u(R;)) 
6 — u(S) n | |(g? | g7 is defined}; 
if rank(0) is even then 0 — dec(0); 
return 0; 


Yan & dw - 


Example 15. Let us continue in Section 5.1 and per- (pi44) (p.47 r7, 8:7} 


M 


form inner macrostate analysis starting with the TRUB ONES 

{{p:1}, {p:1, q:1}, {p:7, 9:7, r:7, 5:7) ) obtained from užu NN WX 

We show three iterations of the algorithm for {p,q,r, s} in no 4:7, m sm 
the right-hand side (we do not show {p,q} except the first | 

iteration since it does not affect intermediate steps). We can 


3 s A š : 6, q:6, r:7, 5:7 
notice that in the three iterations, we could decrease the maxi- pores tex qu 


mum rank estimate to (p:6, q:6, r:6, s:6) due to the accepting | 
transitions from r and s. In the last of the three iterations, when p:6, 9:6, 1:6, 8:6} 
all states have the even rank 6, the condition on Line 6 would V dec 


p:5. q:5, rib, s:5} 


become true and the rank of all states would be decremented 
to 5 using dec. Then, again, the accepting transitions from r and s would decrease the 
rank of p to 4, which would be propagated to q and so on. Eventually, we would arrive to 
the TRUB {p:1, q:1, r:1, s:1}, which could not be decreased any more, since (p:1, q:1) 
forces the ranks of r and s to stay at 1. oO 


Lemma 16. If u is a TRUB, then u < {S > up, (uU, S, R1,...,Rm)} is a TRUB. 


Corollary 17. When started with a TRUB yo, the inner macrostate analysis terminates 
and returns a TRUB p} „- 


6 Experimental Evaluation 


Used tools and evaluation environment. We implemented the techniques described in 
the previous sections as an extension of the tool RANKER [18] (written in C++). Speaking 
in the terms of [19], the heuristics were implemented on top of the RANKERwaxr config- 
uration (we refer to this previous version as RANKERo; »). We tested the correctness of 
our implementation using Spot’s autcross on all BAs in our benchmark. We compared 
modified RANKER with other state-of-the-art tools, namely, Goar [41] (implementing 
PITERMAN [34], SCHEWE [37], SAFRA [36], and FRiBounG [1]), Spor 2.9.3 [12] (im- 
plementing Redziejowski’s algorithm [35]), SEMINATOR 2 [4], LTL2psran 0.5.4 [23], 
and Rorr [26]. All tools were set to the mode where they output an automaton with 
the standard state-based Büchi acceptance condition. The experimental evaluation was 
performed on a 64-bit GNU/Linux DEBIAN workstation with an Intel(R) Xeon(R) CPU 
E5-2620 running at 2.40 GHz with 32 GiB of RAM and using a timeout of 5 minutes. 


Datasets. As the source of our benchmark, we use the two following datasets: (i) random 
containing 11,000 BAs over a two letter alphabet used in [40], which were randomly 
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Fig. 10: Comparison of the state space generated by our optimizations and other rank- 
based procedures (horizontal and vertical dashed lines represent timeouts). Blue data 
points are from random and red data points are from LTL. Axes are logarithmic. 


generated via the Tabakov-Vardi approach [39], starting from 15 states and with var- 
ious different parameters; (ii) LTL with 1,721 BAs over larger alphabets (up to 128 
symbols) used in [4], which were obtained from LTL formulae from literature (221) or 
randomly generated (1,500). We preprocessed the automata using RABrr [30] and Spot’s 
autfilt (using the --high simplification level), transformed them to state-based ac- 
ceptance BAs (if they were not already), and converted to the HOA format [2]. From 
this set, we removed automata that were (i) semi-deterministic, (ii) inherently weak, 
(iii) unambiguous, or (iv) have an empty language, since for these automata types there 
exist more efficient complementation procedures than for unrestricted BAs [5,4,6,28]. 
In the end, we were left with 2,592 (random) and 414 (LTL) hard automata. We use all 
to denote their union (3,006 BAs). Of these hard automata, 458 were elevator automata. 


6.1 Generated State Space 


In our first experiment, we evaluated the effectiveness of our heuristics for pruning the 
generated state space by comparing the sizes of complemented BAs without postprocess- 
ing. This use case is directed towards applications where postprocessing is irrelevant, 
such as inclusion or equivalence checking of BAs. 

We focused on a comparison with two less optimized versions of the rank-based com- 
plementation procedure: ScHEWE (the version “Reduced Average Outdegree" from [37] 
implemented in Goa under -m rank -tr -ro)andits optimization RANKERo; ». The 
scatter plots in Fig. 10 compare the numbers of states of automata generated by RANKER 
and the other algorithms and the upper part of Table 1 gives summary statistics. Observe 
that our optimizations from this paper drastically reduced the generated search space 
compared with both ScHEWE and RANKERo,, (the mean for SCHEWE is lower than for 
RANKERg, due to its much higher number of timeouts); from Fig. 10b we can see that 
the improvement was in many cases exponential even when compared with our previous 
optimizations in RANKEROo;». The median (which is a more meaningful indicator with 
the presence of timeouts) decreased by 44 % w.r.t. RANKERo; », and we also reduced the 


132 Vojtěch Havlena, Ondřej Lengál, Barbora Smahlíková 


Table 1: Statistics for our experiments. The upper part compares various optimizations of 
the rank-based procedure (no postprocessing). The lower part compares RANKER to other 
approaches (with postprocessing). The left-hand side compares sizes of complement BAs 
and the right-hand side runtimes of the tools. The wins and losses columns give the 
number of times when RANKER was strictly better and worse. The values are given for 
the three datasets as "all (random : LTL)”. Approaches in Goat are labelled with &. 


method mean median wins losses mean runtime [s] median runtime [s] timeouts 


RANKERQ,» 7398 (8688:358) 141 (197:29) 2190 (2011 : 179) 111 (107:4) | 9.37 (10.73:1.99) 0.61 (1.04:0.04) 365 (360:5) 
ScuEwEG 4550 (5495: 665) 439 (774: 35) 2640 (2315:325) 55 (1:54) |21.05 (24.28:7.80) 6.57 (7.39:5.21) 937 (928:9) 


Prrerman @ 73 (82:22) 28 (34:14) 1435 (1124: 311) 416 (360:56) | 7.29 (7.39:6.65) 5.99 (6.04:5.62) 14 (12:2) 
Sarra © 83 (91:30) 29 (35:17) 1562 (1211: 351) 387 (350:37) |14.11 (15.05:8.37) 6.71 (6.92:5.79) 172 (158:14) 
Spor 75 (85:15) 24 (32:10) 1087 (936: 151) 683 (501:182)| 0.86 (0.99:0.06) 0.02 (0.02:0.02) — 13 (13:0) 
Fripourc@® 91 (104:13) 23 (31:9) 1120 (1055:65) 601 (376: 225)/17.79 (19.53: 7.22) 9.25 (10.15:5.48) 81 (80:1) 
LTL2pstar 73 (82:21) 28 (34:13) 1465 (1195: 270) 465 (383: 82) | 3.31 (3.84:0.11) 0.04 (0.05:0.02) — 136 (130:6) 
Seminator2 79 (91:15) 21 (29:10) 1266 (1131: 135) 571 (367:204)| 9.51 (11.25:0.08) 0.22 (0.39:0.02) 363 (362:1) 
RoLL 18 (19:14) 10 (9:11) 2116 (1858:258) 569 (443:126)|31.23 (37.85:7.28) 8.19 (12.23:2.74) 1109 (1106:3) 


number of timeouts by 23 %. Notice that the numbers for the LTL dataset do not differ 
as much as for random, witnessing the easier structure of the BAs in LTL. 


6.2 Comparison with Other Complementation Techniques 

In our second experiment, we compared the improved RANKER with other state-of-the- 
art tools. We were comparing sizes of output BAs, therefore, we postprocessed each 
output automaton with autfilt (simplification level - -high). Scatter plots are given 
in Fig. 11, where we compare RANKER with Spot (which had the best results on average 
from the other tools except RoLL) and Rot, and summary statistics are in the lower 
part of Table 1. Observe that RANKER has by far the lowest mean (except RoLL) and the 
third lowest median (after SEMINATOR 2 and ROLL, but with less timeouts). Moreover, 
comparing the numbers in columns wins and losses we can see that RANKER gives strictly 
better results than other tools (wins) more often than the other way round (losses). 

In Fig. 11a see that indeed in the majority of cases RANKER gives a smaller BA than 
Spor, especially for harder BAs (Spot, however, behaves slightly better on the simpler 
BAs from LTL). The results in Fig. 1 1b do not seem so clear. ROLL uses a learning-based 
approach—more heavyweight and completely orthogonal to any of the other tools—and 
can in some cases output a tiny automaton, but does not scale, as observed by the number 
of timeouts much higher than any other tool. It is, therefore, positively surprising that 
RANKER could in most of the cases still obtain a much smaller automaton than Rorr. 

Regarding runtimes, the prototype implementation in RANKER is comparable to SEM- 
INATOR 2, but slower than Spor and LTL2pstar (Spor is the fastest tool). Implementa- 
tions of other approaches clearly do not target speed. We note that the number of timeouts 
of RANKER is still higher than of some other tools (in particular PITERMAN, Spot, FRI- 
BOURG); further state space reduction targeting this particular issue is our future work. 


7 Related Work 


BA complementation remains in the interest of researchers since their first introduction 
by Büchi in [8]. Together with a hunt for efficient complementation techniques, the effort 
has been put into establishing the lower bound. First, Michel showed that the lower bound 
is n! (approx. (0.367)") [31] and later Yan refined the result to (0.7672)" [43]. 
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Fig. 11: Comparison of the complement size obtained by RANKER and other state-of-the- 
art tools (horizontal and vertical dashed lines represent timeouts). Axes are logarithmic. 


The complementation approaches can be roughly divided into several branches. 
Ramsey-based complementation, the very first complementation construction, where 
the language of an input automaton is decomposed into a finite number of equivalence 
classes, was proposed by Biichi and was further enhanced in [7]. Determinization- 
based complementation was presented by Safra in [36] and later improved by Piterman 
in [34] and Redziejowski in [35]. Various optimizations for determinization of BAs were 
further proposed in [29]. The main idea of this approach is to convert an input BA into an 
equivalent deterministic automaton with different acceptance condition that can be easily 
complemented (e.g. Rabin automaton). The complemented automaton is then converted 
back into a BA (often for the price of some blow-up). Slice-based complementation tracks 
the acceptance condition using a reduced abstraction on a run tree [42,21]. A learning- 
based approach was introduced in [27,26]. Allred and Ultes-Nitsche then presented 
a novel optimal complementation algorithm in [1]. For some special types of BAs, e.g., 
deterministic [25], semi-deterministic [5], or unambiguous [28], there exist specific 
complementation algorithms. Semi-determinization based complementation converts 
an input BA into a semi-deterministic BA [11], which is then complemented [4]. 

Rank-based complementation, studied in [24,15,14,37,22], extends the subset con- 
struction for determinization of finite automata by storing additional information in 
each macrostate to track the acceptance condition of all runs of the input automaton. 
Optimizations of an alternative (sub-optimal) rank-based construction from [24] go- 
ing through alternating Biichi automata were presented in [15]. Furthermore, the work 
in [22] introduces an optimization of SCHEWE, in some cases producing smaller au- 
tomata (this construction is not compatible with our optimizations). As shown in [9], 
the rank-based construction can be optimized using simulation relations. We identified 
several heuristics that help reducing the size of the complement in [19], which are 
compatible with the heuristics in this paper. 


Acknowledgements. We thank anonymous reviewers for their useful remarks that helped 
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Foundation project 20-07487S and the FIT BUT internal project FIT-S-20-6427. 
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Abstract. Parity games can be used to represent many different kinds 
of decision problems. In practice, tools that use parity games often rely 
on a specification in a higher-order logic from which the actual game 
can be obtained by means of an exploration. For many of these decision 
problems we are only interested in the solution for a designated vertex in 
the game. We formalise how to use on-the-fly solving techniques during 
the exploration process, and show that this can help to decide the winner 
of such a designated vertex in an incomplete game. Furthermore, we 
define partial solving techniques for incomplete parity games and show 
how these can be made resilient to work directly on the incomplete game, 
rather than on a set of safe vertices. We implement our techniques for 
symbolic parity games and study their effectiveness in practice, showing 
that speed-ups of several orders of magnitude are feasible and overhead 
(if unavoidable) is typically low. 


1 Introduction 


A parity game is a two-player game with an w-regular winning condition, played 
by players } (‘even’) and O (‘odd’) on a directed graph. The true complexity of 
solving parity games is still a major open problem, with the most recent break- 
throughs yielding algorithms running in quasi-polynomial time, see, e.g., [18,7]. 
Apart from their intriguing status, parity games pop up in various fundamental 
results in computer science (e.g., in the proof of decidability of a monadic second- 
order theory). In practice, parity games provide an elegant, uniform framework 
to encode many relevant decision problems, which include model checking prob- 
lems, synthesis problems and behavioural equivalence checking problems. 

Often, a decision problem that is encoded as a parity game, can be answered 
by determining which of the two players wins a designated vertex in the game 
graph. Depending on the characteristics of the game, it may be the case that 
only a fraction of the game is relevant for deciding which player wins a vertex. 
For instance, deciding whether a transition system satisfies an invariant can be 
encoded by a simple, solitaire (i.e., single player) parity game. In such a game, 
player O wins all vertices that are sinks (4.e., have no successors), and all states 
leading to such sinks, so checking whether sinks are reachable from a designated 
vertex suffices to determine whether this vertex is won by L, too. Clearly, as soon 
as a sink is detected, any further inspection of the game becomes irrelevant. 
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A complicating factor is that in practice, the parity games that encode deci- 
sion problems are not given explicitly. Rather, they are specified in some higher- 
order logic such as a parameterised Boolean equation system, see, e.g. [11]. Ex- 
ploring the parity game from such a higher-order specification is, in general, 
time-and memory-consuming. 'To counter this, symbolic exploration techniques 
have been proposed, see e.g. [19]. These explore the game graph on-the-fly and 
exploit efficient symbolic data structures such as LDDs [13] to represent sets of 
vertices and edges. Many parity game solving algorithms can be implemented 
quite effectively using such data structures [20,28,29], so that in the end, explor- 
ing the game graph often remains the bottleneck. 


In this paper, we study how to combine the exploration of a parity game 
and the on-the-fly solving of the explored part, with the aim to speed-up the 
overall solving process. The central problem when performing on-the-fly solving 
during the exploration phase is that we have to deal with incomplete information 
when determining the winner for a designated vertex. Moreover, in the symbolic 
setting, the exploration order may be unpredictable when advanced strategies 
such as chaining and saturation [9] are used. 


'To formally reason about all possible exploration strategies and the artefacts 
they generate, we introduce the concept of an incomplete parity game, and an 
ordering on these. Incomplete parity games are parity games where for some 
vertices not all outgoing edges are necessarily known. In practice, these could be 
identified by, e.g., the todo queue in a classical breadth-first search. The extra 
information captured by an incomplete parity game allows us to characterise 
the safe set for a given player a. This is a set of vertices for which it can be 
established that if player œ wins the vertex, then she cannot lose the vertex if 
more information becomes available. We prove an optimality result for safe sets, 
which, informally, states that a safe set for player a is also the largest set with 
this property (see Theorem 1). 


The vertices won by player o in an a-safe set can be determined using a 
standard parity game solving algorithm such as, e.g., Zielonka's recursive al- 
gorithm [31] or Priority Promotion [2]. However, these algorithms may be less 
efficient as on-the-fly solvers. For this reason, we study three symbolic partial 
solvers: solitaire winning cycle detection, forced winning cycle detection and fa- 
tal attractors |17]. In particular cases, first determining the safe set for a player 
and only subsequently solving the game using one of these partial solvers will 
incur an additional overhead. As a final result, we therefore prove that all these 
solvers can be (modified to) run on the incomplete game as a whole, rather than 
on the safe set of a player (see Propositions 1-3). 


As a proof of concept, we have implemented an (open source) symbolic tool 
for the mCRL2 toolset [6], that explores a parity game specified by a parame- 
terised Boolean equation system and solves these games on-the-fly. We report 
on the effectiveness of our implementation on typical parity games stemming 
from, e.g., model checking and equivalence checking problems, showing that it 
can speed up the process with several orders of magnitude, while adding low 
overhead if the entire game is needed for solving. 
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Related Work. Our work is related to existing techniques for solving symbolic 
parity games such as [20,19], as we extend these existing methods with on-the- 
fly solving. Naturally, our work is also related to existing work for on-the-fly 
model checking. This includes work for on-the-fly (explicit) model checking of 
regular alternation-free modal mu-calculus formulas [23] and work for on-the- 
fly symbolic model checking of RCTL [1]. Compared to these our method is 
more general as it can be applied to the full modal mu-calculus (with data), 
which subsumes RCTL and the alternation-free subset. Optimisations such as 
the observation that checking LTL formulas of type AG reduces to reachability 
checks [14] are a special case of our methods and partial solvers. Furthermore, our 
methods are not restricted to model checking problems only and can be applied 
to any parity game, including decision problems such as equivalence checking [8]. 
Furthermore, our method is agnostic to the exploration strategy employed. 


Structure of the paper. In Section 2 we recall parity games. In Section 3 we 
introduce incomplete parity games and show how partial solving can be applied 
correctly. In Section 4 we present several partial solvers that we employ for 
on-the-fly solving. Finally, in Section 5 we discuss the implementation of these 
techniques and apply them to several practical examples. The omitted proofs for 
the supporting lemmas can be found in [22]. 


2 Preliminaries 


A parity game is an infinite-duration, two-player game that is played on a finite 
directed graph. The objective of the two players, called even (denoted by Q) and 
odd (denoted by L1), is to win vertices in the graph. 


Definition 1. A parity game is a directed graph G = (V, E, p, (Vo, V3)), where 


— V is a finite set of vertices, partitioned in sets V4 and Vp of vertices owned 
by © and O, respectively; 

— ECV xV is the edge relation; 

— p: V >N is a function that assigns a priority to each node. 


Henceforth, let G = (V, E, p, (Vo, V3)) be an arbitrary parity game. Throughout 
this paper, we use o to denote an arbitrary player and à denotes the opponent. 
We write vE to denote the set of successors (w € V | (v,w) € E) of vertex 
v. The set sinks(G) is defined as the largest set U C V satisfying for all v € U 
that vE = 0; i.e., sinks(G) is the set of all sinks: vertices without successors. 
If we are only concerned with the sinks of player a, we write sinks,(G); i.e., 
sinks, (G) = Va Nsinks(G). We write GNU, for U C V, to denote the subgame 
(U,(U x U) N E, plu, (Vo N U, Va N U)), where pfu (v) = p(v) for all vertices 
vcU. 


Example 1. Consider the graph depicted in Figure 1, representing a parity game. 
Diamond-shaped vertices are owned by player 4, whereas box-shaped vertices 
are owned by player O. The priority of a vertex is written inside the vertex. 
Vertex u is a sink owned by player 
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Fig. 1. An example parity game 


Plays and strategies. The game is played as follows. Initially, a token is placed on 
a vertex of the graph. The owner of a vertex on which the token resides gets to 
decide the successor vertex (if any) that the token is moved to next. A maximal 
sequence of vertices (i.e., an infinite sequence or a finite sequence ending in a 
sink) visited by the token by following this simple rule is called a play. A finite 
play m is won by player Q if the sink in which it ends is owned by player O, and 
it is won by player L1 if the sink is owned by player Q. An infinite play 7 is won 
by player © if the minimal priority that occurs infinitely often along 7 is even, 
and it is won by player O otherwise. 

A strategy Ca : V*V4 — V for player o is a partial function that prescribes 
where player a moves the token next, given a sequence of vertices visited by the 
token. A play vo v1 ... is consistent with a strategy c if and only if e(vg...v;) = 
vi+ı for all 4 for which o(vo ...v;) is defined. Strategy oq is winning for player 
a in vertex v if all plays consistent with og and starting in v are won by a. 
Player a wins vertex v if and only if she has a winning strategy oq for vertex v. 
The parity game solving problem asks to compute the set of vertices Wo, won 
by player © and the set Wp, won by player O. Note that since parity games are 
determined [31,24], every vertex is won by one of the two players. That is, the 
sets Wo and Wp partition the set V. 


Example 2. Consider the parity game depicted in Figure 1. In this game, the 
strategy oo, partially defined as eo (uo) = u2 and oo (725) = uo, for arbitrary 
7, is winning for player > in uo and ug. Player O wins vertex u3 using strategy 
oo(mu3) = ua, for arbitrary m. Note that player > is always forced to move the 
token from u4 to ug. Vertex u; is a sink, owned by player O, and hence, won by 
player ©. 


Dominions. A strategy Ca is said to be closed on a set of vertices U C V iff 
every play, consistent with c, and starting in a vertex v € U remains in U. If 
player o has a strategy that is closed on U, we say that the set U is o-closed. 
A dominion for player a is a set of vertices U C V such that player a has a 
strategy Ca that is closed on U and which is winning for a. Note that the sets 
Wo and Wp are dominions for player O and player O, respectively, and, hence, 
every vertex won by player a must belong to an a-dominion. 


Example 3. Reconsider the parity game of Figure 1. Observe that player O has 
a closed strategy on {u3, u4}, which is also winning for player O. Hence, the 
set (us,u4) is an O-dominion. Furthermore, the set {u2, u3, u4} is O-closed. 
However, none of the strategies for which {u2, u3, u4} is closed for player Q is 
winning for her; therefore {u2, ug, u4] is not an O-dominion. 
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Predecessors, control predecessors and attractors. Let U C V be a set of vertices. 
We write pre(G, U) to denote the set of predecessors {v € V | Jue U: ue 
vE} of U in G. The control predecessor set of U for player a in G, denoted 
cpre, (G, U), contains those vertices for which a is able to force entering U in 
one step. It is defined as follows: 


cpre, (G, U) = (Va N pre(G, U)) U (Va \ (pre(G, V \ U) U sinks(G))) 


Note that both pre and cpre are monotone operators on the complete lattice 
(2, C). The a-attractor to U in G, denoted Attr4(G, U), is the set of vertices 
from which player o can force play to reach a vertex in U: 


Attro(G,U) = uZ.(U U cpre4(G, Z)) 


The a-attractor to U can be computed by means of a fixed point iteration, 
starting at U and adding a-control predecessors in each iteration until a stable 
set is reached. We note that the a-attractor to an a-dominion D is again an 
a-dominion. 


Example 4. Consider the parity game G of Figure 1 once again. The -control 
predecessors of {u2} is the set {uo}. Note that since player O can avoid moving 
to ug from vertex ua by moving to vertex u4, vertex ua is not among the Q- 
control predecessors of {u2}. The -attractor to {u2} is the set (uo, u2}, which 
is the largest set of vertices for which player > has a strategy to force play to 
the set of vertices {u2}. 


3 Incomplete Parity Games 


In many practical applications that rely on parity game solving, the parity game 
is gradually constructed by means of an exploration, often starting from an ‘ini- 
tial’ vertex. This is, for instance, the case when using parity games in the context 
of model checking or when deciding behavioural preorders or equivalences. For 
such applications, it may be profitable to combine exploration and solving, so 
that the costly exploration can be terminated when the winner of a particular 
vertex of interest (often the initial vertex) has been determined. The example 
below, however, illustrates that one cannot naively solve the parity game con- 
structed so far. 


Example 5. Consider the parity game G in Figure 2, consisting of all vertices 
and only the solid edges. This game could, for example, be the result of an 
exploration starting from u4. Then G (ug, u1, U2, U3, U4, Us} is a subgame for 
which we can conclude that all vertices form an -dominion. However, after 
exploring the dotted edges, player [] can escape to vertex u4 from vertex us. 
Consequently, vertices u4 and us are no longer won by player Q in the extended 
game. Furthermore, observe that the additional edge from ua to us does not 
affect the previously established fact that player © wins this vertex. 
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us 


Fig. 2. A parity game where the dotted edges are not yet known. 


'To facilitate reasoning about games with incomplete information, we first intro- 
duce the notion of an incomplete parity game. 


Definition 2. An incomplete parity game is a structure O = (G, I), where G is 
a parity game (V, E, p, (Vo, V3)), and I C V is a set of vertices with potentially 
unezplored successors. We refer to the set I as the set of incomplete vertices; 
the set V \ I is the set of complete vertices. 


Observe that (G,@) is a ‘standard’ parity game. We permit ourselves to use 
the notation for parity game notions such as plays, strategies, dominions, etcetera 
also in the context of incomplete parity games. In particular, for © = (G,I), 
we will write pre(2, U) and Attro (2, U) to indicate pre(G, U) and Attra(G, U), 
respectively. Furthermore, we define 9 N U as the structure (GO U, In U). 

Intuitively, while exploring a parity game, we extend the set of vertices and 
edges by exploring the incomplete vertices. Doing so gives rise to potentially 
new incomplete vertices. At each stage in the exploration, the incomplete parity 
game extends incomplete parity games explored in earlier stages. We formalise 
the relation between incomplete parity games, abstracting from any particular 
order in which vertices and edges are explored. 


Definition 3. Let Ð = ((V, E, p, (Vo, V3)). I), 9' = ((V', E’, p', (V5, VG), I^) be 
incomplete parity games. We write Ð E D' iff the following conditions hold: 

(1) V € V', Vo € Vg and Va € V5; 
(2) ECE’ and ((V\ I) xV)NE' CE; 
(3) p =p'lv; 

4) 'nVCI 


Conditions (1) and (3) are self-explanatory. Condition (2) states that on the 
one hand, no edges are lost, and, on the other hand, E' can only add edges 
from vertices that are incomplete: for complete vertices, Æ’ specifies no new 
successors. Finally, condition (4) captures that the set of incomplete vertices I’ 
cannot contain vertices that were previously complete. We note that the ordering 
E is reflexive, anti-symmetric and transitive. 


Example 6. Suppose that 9 = (G, I) is the incomplete parity game depicted in 
Figure 2, where G is the game with all vertices and only the solid edges, and 
I = [us,us). Then Ð C D’, where 0’ = (G', 1^) is the incomplete parity game 
where G” is the depicted game with all vertices and both the solid edges and 
dotted edges, and I’ = Ó. 


On-The-Fly Solving for Symbolic Parity Games 143 


Let us briefly return to Example 5. We concluded that the winner of vertex 
u4 (and also u5) changed when adding new information. The reason is that 
player O has a strategy to reach an incomplete vertex owned by her. Such an 
incomplete vertex may present an opportunity to escape from plays that would 
be non-winning otherwise. On the other hand, the incomplete vertex u3 has 
already been sufficiently explored to allow for concluding that this vertex is 
won by player Q, even if more successors are added to u3. This suggests that 
for some subset of vertices, we can decide their winner in an incomplete parity 
game and preserve that winner in all future extensions of the game. We formally 
characterise this set of vertices in the definition below. 


Definition 4. Let 9 = (G,I), with G = (V,E,p,(Vo,Va)) be an incomplete 
parity game. The a-safe vertices for 9, denoted by safe,(2), is the set V \ 
Attra(G, Va N I). 


Example 7. Consider the incomplete parity game d of Example 6 once more. We 
have safes (©) = (uo, u1, U2, ug} and safeg(©) = (uo, u1, U2, U4, us]. 


In the remainder of this section, we show that it is indeed the case that while 
exploring a parity game, one can only safely determine the winners in the sets 
safer (2) and safe (2), respectively. More specifically, we claim (Lemma 1) that 
all a-dominions found in safe,(2) are preserved in extensions of the game, and 
(Lemma 2) the winner of vertices not in safe, (2) are not necessarily won by the 
same player in extensions of the game. 


Lemma 1. Given two incomplete games Ð and 9' such that O E 9'. Any a- 
dominion in Ð N safeg(O) is also an a-dominion in 9. 


Example 8. Recall that in Example 7, we found that safeg (2) = (uo, u1, U2, ua]. 
Observe that in the incomplete parity game Ð of Example 6, restricted to vertices 
[uo, U1, U2, us}, all vertices are won by player >, and, hence, (uo, u1, us, us} is 
an ©-dominion. Following Lemma 1 we can indeed conclude that this remains an 
O-dominion in all extensions of Ð, and, in particular, for the (complete) parity 
game 0’ of Example 6. 


Lemma 2. Let © be an incomplete parity game. Suppose that W is an a- 
dominion in 2. If W Z safe4(2), then there is an (incomplete) parity game 
O' such that Ð E 9'and all vertices in W \ safe,(O) are won by a. 


As a corollary of the above lemma, we find that a-dominions that contain 
vertices outside of the a-safe set are not guaranteed to be dominions in all 
extensions of the incomplete parity game. 


Corollary 1. Let 2 be an incomplete parity game. Suppose that W is an a- 
dominion in 2. If W € safea(©), then there is an (incomplete) parity game 0! 
such that D CD! and W is not an a-dominion in 9'. 


'The theorem below summarises the two previous results, claiming that the 
sets safe; (D) and safeq(O) are the optimal subsets that can be used safely when 
combining solving and the exploration of a parity game. 
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Theorem 1. Let 0 = (G,I), with G = (V, E, p, (Vo, V3)), be an incomplete 
parity game. Define Wa as the union of all a-dominions in DM safe4(2), and let 
Ws» = VN (Wọ U Wo). Then W» is the largest set of vertices v for which there 
are incomplete parity games D% and 2? such that D C O* and Ð E O? and v is 
won by a in D% and v is won by à in DÀ. 


Proof. Let 2, with G = (V, E, p, (Vo, V3)) be an incomplete parity game. Pick 
a vertex v € W>. Suppose that in G, vertex v € W» is won by player a. Let 
O^ = 9. Then 9 E 9? and v is also won by a in 2?. 

Next, we argue that there must be a game O^ such that Ð C 2^ and v is 
won by @ in 9?. Since v € W? is won by player o in G, v must belong to an 
a-dominion in G. Towards a contradiction, assume that v € safe4 (©). Then there 
must also be a a-dominion containing v in GM safe; (2), since à cannot escape 
the set safe,(2). But then v € Wa. Contradiction, so v ¢ safe, (O). So, v must 
be part of an a-dominion D in G such that D £ safe4(2). By Lemma 2, we find 
that there is an incomplete parity game 9^ such that 9 C D® and all vertices in 
D \ safea (©), and vertex v € D in particular, are won by @ in 9^. 

Finally, we argue that W^? cannot be larger. Pick a vertex v € W». Then there 
must be some player o such that v € Wa, and, consequently, there must be an 
o-dominion D C 2 N safea (©) such that v € D. But then by Lemma 1, we find 
that v is won by a in all incomplete parity games Ð’ such that Ð C 9. 


4 On-the-fly Solving 


In the previous section we saw that for any solver solve,, which accepts a parity 
game as input and returns an a-dominion Wa, a correct on-the-fly solving algo- 
rithm can be obtained by computing Wa = solves (2 N safea (©)) while exploring 
an (incomplete) parity game Ə. While this approach is clearly sound, computing 
the set of safe vertices can be expensive for large state spaces and potentially 
wasteful when no dominions are found afterwards. We next introduce safe at- 
tractors which, we show, can be used to search for specific dominions without 
first computing the a-safe set of vertices. 


4.1 Safe Attractors 


We start by observing that the a-attractor to a set U in an incomplete parity 
game D does not make a distinction between the set of complete and incomplete 
vertices. Consequently, it may wrongly conclude that o has a strategy to force 
play to U when the attractor strategy involves incomplete vertices owned by a. 
We thus need to make sure that such vertices are excluded from consideration. 
This can be achieved by considering the set of unsafe vertices VaN T as potential 
vertices that can be used by the other player to escape. We define the safe a- 
attractor as the least fixed point of the safe control predecessor. The latter is 
defined as follows: 


spre, (0, U) = (Va N pre(O, U)) U (Va \ (pre, V \ U) Usinks(2) U 7)) 
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Lemma 3. Let9 be an incomplete parity game. For all vertex sets X C safea (Ð) 
it holds that cpre,(O N safe4 (0), X) = spre,(O, X). 


The safe a-attractor to U, denoted SAttr4(2, U), is the set of vertices from 
which player a can force to safely reach U in 9: 


SAttra (©, U) = wZ.(U U spre, (©, Z)) 


Lemma 4. Let Ð be an incomplete parity game, and X C safe4(O). Then 
Attra (© N safe,(2), X) = SAttr, (2, X). 


In particular, we can conclude the following: 


Corollary 2. Let Ð be an incomplete parity game, and X C safe,(O) be an 
a-dominion. Then SAttr,(O, X) is an a-dominion for all 0’ satisfying Ð E 9. 


One application of the above corollary is the following: since on-the-fly solving is 
typically performed repeatedly, previously found dominions can be expanded by 
computing the safe a-attractor towards these already solved vertices. Another 
corollary is the following, which states that complete sinks can be safely attracted 
towards. 


Corollary 3. Let © = (G,I) be an incomplete parity game and let 0! be such 
that 0 CO’. Then SAttr4(O, sinks (2) V I) is an a-dominion ino’. 


4.2 Partial Solvers 


In practice, a full-fledged solver, such as Zielonka's algorithm [31] or one of 
the Priority Promotion variants [2], may be costly to run often while exploring 
a parity game. Instead, cheaper partial solvers may be used that search for 
a dominion of a particular shape. We study three such partial solvers in this 
section, with a particular focus on solvers that lend themselves for parity games 
that are represented symbolically using, e.g., BDDs [5], MDDs [25] or LDDs [13]. 
For the remainder of this section, we fix an arbitrary incomplete parity game 
d = ((V, E, p, (Vo, V3)), I). 


Winning solitaire cycles. A simple cycle in O can be represented by a finite 
sequence of distinct vertices vo v, ... Un satisfying vo € v, E. Such a cycle is an 
a-solitaire cycle whenever all vertices on that cycle are owned by player a. 
Observe that if all vertices on an a-solitaire cycle have a priority that is of 
the same parity as the owner a, then all vertices on that cycle are won by player 
a. Formally, these are thus cycles through vertices in the set P, O Va, where 
Po = (v €VNsinks(O) | p(v) mod 2 = 0) and = {v € V \ sinks(D) | p(v) 
mod 2 = 1}. Let C&\(©) represent the largest set of a-solitaire winning cycles. 


sol 


Then CH (©) 2 vZ.(P, n Va N pre(9, Z)). 
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Proposition 1. The set C24 (©) is an a-dominion and we have C2; 


(9) C safe, (2). 


Proof. We first prove that Cz (2) € safe, (2). We show, by means of an induction 
on the fixed point approximants A; of the attractor, that C% (2) n Attra(9, Va N 
I) — 0. The base case follows immediately, as c. (5) n Ap = CHO) NG = 0. 
For the induction, we assume that C2i(2) N A; = 0; we show that also C24,(2) à 
((Va NI) U cprea (©, A:)) = 0. First, observe that CX (©) C Va; hence, it suffices 
to prove that C2,(0) A (Va \ (prela, V \ 4) U sinks(©)) = 0. But this follows 
immediately from the fact that for every vertex v € C (©), we have v € Pa N 
Va N pre(O, CX (©)); more specifically, we have vE NCS (©) z 0 for all v € CX). 
The fact that C% (©) is an a-dominion follows from the fact that for every 
vertex v € CH (©), there is some w € vE N C% (©). This means that player a 
must have a strategy that is closed on C2, (O). Since all vertices in C2,(O) are of 


sol 
the priority that is beneficial to a, this closed strategy is also winning for a. 


Observe that winning solitaire cycles can be computed without first computing 
the a-safe set. Parity games that stand to profit from detecting winning solitaire 
cycles are those originating from verifying safety properties. 


Winning forced cycles. In general, a cycle in safea(©), through vertices in Po 
can contain vertices of both players, providing player [O an opportunity to break 
the cycle if that is beneficial to her. Nevertheless, if breaking a cycle always 
inadvertently leads to another cycle through P5, then we may conclude that all 
vertices on these cycles are won by player ©. We call these cycles winning forced 
cycles for player >. A dual argument applies to cycles through Pg. Let C2, (9) 
represent the largest set of vertices that are on winning forced cycles for player 
a. More formally, we define Cg, (O) = vZ.(P4 N safea (©) n cpre4(9, Z)). 


Lemma 5. The set C2 (9) € safe, (2). 


for 
A possible downside of the above construction is that it again requires to first 
compute safe, (2), which, in particular cases, may incur an additional overhead. 
Instead, we can compute the same set using the safe control predecessor. We 
define C? (0) = vZ.(P4 N spre, (2, Z)). 


s—for 


Proposition 2. We have CR (9) = C2 4 (9). 


Proof. Let T(Z) = Pa N spre, (©, Z). We use set inclusion to show that C2 
indeed a fixed point of 7. 


— ad Cg, (9) € r(Cg,(2)). Pick a vertex v € CR, (©). By definition of C£, (2), 
we have v € Pa N safe4(9) N cpre, (0, C2, (O)). Observe that safe,(2) N 
cpre,(0, CR (2)) = safea (©) N cpre, (© N safe, (2), CR, (2)). But then, since 
Ce, (9) € safe, (©), we find, by Lemma 3, that cpre, (0M sate (9), C£. (9) = 
spre, (©, Cg, (2)). Hence, v € Px N spre, (a, Cg. (2)) = T(Cg,(9)). 

— ad C ©) 2 r(C$, ©). Again pick a vertex v € T(Cg,(2)). Then v € 
Pa N spre, (©, Cg, (2)). Since Cg, (©) € safe, (D), by Lemma 3, we again have 
spre, (2, CQ (2)) = cpre,(0 safe; (2), CR (2)). But then it must be the case 
that v € safe, (©). Moreover, cpre, (© N safe, (2), C2, (2)) € cpre4 (O, Cg, (2)). 
So v € Pa N safea (©) n cpre, (2, C2, (2)) = Ce, (9). 


for 


(d) is an a-dominion and we have Cg. 


(©) is 


for 
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We show next that for any Z = T(Z), we have Z C Cg, (0). Let Z be such. We first 
show that for every v € Zn, there is some w € v EnZ, and for every v € ZnWVa, 
we have v ¢ sinks(2), v ¢ J and vE C Z. Pick v € ZN Va. Then v € T(Z)N Va = 
Pa A Va A spre, (©, Z) € pre(O, Z). But then vE N Z Æ 0. Next, let v € ZN Va. 
Then v € 7(Z)N Va = Pa Vanspre,(9, Z) € Va \ (pre, V \ Z)Usinks(2)U I). 
So v ¢ pre(O, V V Z) Usinks(O) U I. Consequently, vE C Z, v € sinks(O) and 
vél. 

Since for every v € Z N Va, we have vE N Z # (, there must be a strategy 
for player a to move to another vertex in Z. Let ø be this strategy. Moreover, 
since for all v € Z A Va we have v E C Z, we find that c is closed on Z and since 
Z N sinks(©) = (), strategy o induces forced cycles. Moreover, since Z C Py, we 
can conclude that all vertices in Z are on winning forced cycles. 

Finally, we must argue that Z C safe4 (©). But this follows from the fact that 
ZA Va I = 0, and, hence, also Z N Attra(9, Va NI) = 0. Since Z is contained 
within Pa safe, (©), we find that Z € Cg, (9). 


Fatal attractors. Both solitaire cycles and forced cycles utilise the fact that the 
parity winning condition becomes trivial if the only priorities that occur on 
a play are of the parity of a single player. Fatal attractors [17] were originally 
conceived to solve parts of a game using algorithms that have an appealing worst- 
case running time; for a detailed account, we refer to [17]. While ibid. investigates 
several variants, the main idea behind a fatal attractor is that it identifies cycles 
in which the priorities are non-decreasing until the dominating priority of the 
attractor is (re)visited. We focus on a simplified (and cheaper) variant of the 
psolB algorithm of [17], which is based on the concept of a monotone attractor, 
which, in turn, relies on the monotone control predecessor defined below, where 
P2* = (v € V | p(v) > c): 


Mopre, (2, Z, U, c) = P=° n cpre, (©, ZU U) 


The monotone attractor for a given priority is then defined as the least fixed point 
of the monotone control predecessor for that priority, formally MAttr (2, U, c) = 
iL Z.Mcpre, (©, Z, U,c). A fatal attractor for priority c is then the largest set of 
vertices closed under the monotone attractor for priority c; i.e., F%(O,c) = 
v Z.(P-* N safe,(2) n MAttra (© N safe (2), Z, c)), where P=° = P2° \ P2°+1. 


Lemma 6 (See [17], Theorem 2). For even c, we have that MAttro(0 ^ 
safe, (2), F? (,c), c) C safey(O) and MAttro (© N safe, (2), F? (2, c), c) is an Q- 
dominion. If c is odd then we have MAttra(O n safe (©), F- (2, c), c) € safeg(O) 
and MAttra(0 N safe4 (2), F- (d,c), c) is an O-dominion. 


Our simplified version of the pso1B algorithm, here dubbed solB^ computes 
fatal attractors for all priorities in descending order, accumulating ® and O- 
dominions and extending these dominions using a standard Q or Ll-attractor. 
'This can be implemented using a simple loop over these priorities. 

In line with the previous solvers, we can also modify this solver to employ 
a safe monotone control predecessor, which uses a construction that is similar 
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in spirit to that of the safe control predecessor. Formally, we define the safe 
monotone control predecessor as follows: 


sMcpre, (©, Z, U,c) = P?*^ A spre, (0, Z UU) 


The corresponding safe monotone a-attractor, denoted sMAttra(0,U,c), is de- 
fined as follows: sMAttr,(9,U, c) = uZ.sMcpre, (2, Z,U,c). We define the safe 
fatal attractor for priority c as the set FS (d,c) = vZ.(P-* n sMAttra (©, Z, c)). 


Proposition 3. Let D be an incomplete parity game. We have F2(,c) = 
F? (D, c) for even c and for odd c we have FP (9,c) = FP (9,c). 


Similar to algorithm so1B^ , the algorithm so1B; computes safe fatal attrac- 
tors for priorities in descending order and collects the safe-a-attractor extended 
dominions obtained this way. 


5 Experimental Results 


We experimentally evaluate the techniques of Section 4. For this, we use games 
stemming from practical model checking and equivalence checking problems. 
Our experiments are run, single-threaded, on an Intel Xeon 6136 CPU @ 3 GHz 
PC. The sources for these experiments can be obtained from the downloadable 
artefact [21]. 


5.1 Implementation 


We have implemented a symbolic exploration technique for parity games in the 
mCRL2 toolset [6]. Our tool exploits techniques such as read and write depen- 
dencies [20,4], and uses sophisticated exploration strategies such as chaining and 
saturation [9]. We use MDD-like data structures [25] called List Decision Dia- 
grams (LDDs), and the corresponding Sylvan implementation [13], to represent 
parity games symbolically. Sylvan also offers efficient implementations for set 
operations and relational operations, such as predecessors, facilitating the im- 
plementation of attractor computations, the described (partial) solvers, and a 
full solver based on Zielonka’s recursive algorithm [31], which remains one of the 
most competitive algorithms in practice, both explicitly and symbolically [28,12]. 
For the attractor set computation we have also implemented chaining to deter- 
mine (multi-)step a-predecessors more efficiently. 

For all three on-the-fly solving techniques of Section 4, we have implemented 
1) a variant that runs the standard (partial) solver on the a-safe subgame and 
removes the found dominion using the standard attractor (within that subgame), 
and 2) a variant that uses (partial) solvers with the safe attractors. Moreover, 
we also conduct experiments using the full solver running on an a-safe subgame. 
An important design aspect is to decide how the exploration and the on-the-fly 
solving should interleave. For this we have implemented a time based heuristic 
that keeps track of the time spent on solving and exploration steps. The time 
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measurements are used to ensure that (approximately) ten percent of total time 
is spent on solving by delaying the next call to the solver. We do not terminate 
the partial solver when it requires more time, and thus it is only approximate. 
As a result of this heuristic, cheap solvers will be called more frequently than 
more expensive (and more powerful) ones, which may cause the latter to explore 
larger parts of the game graph. 


5.2 Cases 


'Table 1 provides an overview of the models and a description of the property 
that is being checked. The properties are written in the modal p-calculus with 
data [15]. For the equivalence checking case we have mutated the original model 
to introduce a defect. For each property, we indicate the nesting depth (ND) and 
alternation depth [10] and whether the parity game is solitaire (Yes/No). The 
nesting depth indicates how many different priorities occur in the resulting game; 
for our encoding this is at most ND+2 (the additional ones encode constants 
‘true’ and ‘false’). The alternation depth is an indication of a game's complexity 
due to alternating priorities. 


Table 1. Models and formulas. 


Model Ref. Prop. Result ND AD Sol. Description 
SWP [30] 1 false 1 Y No error transition 
2 false 3 3 N Infinitely often enabled then infinitely often taken 
WMS [27] 1 false 1 Y Job failed to be done 
2 false 1 Y No zombie jobs 
3 true 3 2 Y A job can become alive again infinitely often 
4 fase 2 2 N Branching bisimulation with a mutation 
BKE [3] 1 true 1 Y No secret leaked 
2 false 2 N No deadlock 
CCP [26] 1 false 2 N No deadlock 
2 false 2 N After access there is always accessover possible 
PDI n/a 1 true 2 N Controller reaches state before it can connect again 
2 false 2 N Connection impermissible can always happen or we 
establish a connection 
3 false 3 N When connected move to not ready for connection and 
do not establish a connection until it is allowed again 
4 true 2 N The interlocking moves to the state connection closed 
before it is allowed to succesfully establish a connection 


We use MODEL- to indicate the parity game belonging to model MODEL 
and property 7. Models SWP, BKE and CCP are protocol specifications. The 
model PDI is a specification of a EULYNX SCI-LX SySML interface model that 
is used for a train interlocking system. Finally, WMS is the specification of a 
workload management system used at CERN. Using tools in mCRL2 [6], we have 
converted each model and property combination into a so-called parameterised 
Boolean equation systems [16], a higher-level logic that can be used to represent 
the underlying parity game. 

Parity games SWP-1, WMS-1, WMS-2 and BKE-1 encode typical safety 
properties where some action should not be possible. In terms of the alternation- 
free modal mu-calculus with regular expressions, such properties are of the shape 


150 M. Laveaux, W. Wesselink and T.A.C. Willemse 


[true*.a]false. These properties are violated exactly when the vertex encoding 
‘false’ can be reached. Parity games SWP-2, WMS-3 and WMS-4 are more 
complex properties with alternating priorities, where WMS-4 encodes branching 
bisimulation using the theory presented in [8]. The parity games BKE-2 and 
CCP-1 encode a ‘no deadlock’ property given by a formula which states that 
after every path there is at least one outgoing transition. Finally, CCP-2 and 
all PDI cases contain formulas with multiple fixed points that yield games with 
multiple priorities but no (dependent) alternation. 


Table 2. Experiments with parity games where on-the-fly solving cannot terminate 
early. All run times are in seconds. The number of vertices is given in millions. Memory 
is given in gigabytes. Bold-faced numbers indicate the lowest value. 


Game Strategy Vertices (109) Explore (s) Solve (s) Total (s) Mem (GB) 
BKE-1 full 40 640 65 705 14 
solitaire 40/40 629/615 153/100 782/715 15/15 
cycles 40/40 635/644 149/160 785/804 15/15 
atal 40/40 624/625 152/164 776/789 15/15 
partial 40 651 147 798 15 
PDI-1 full 114 27 0.1 28 2 
solitaire 114/114 28/27 4/0 33/28 2/2 
cycles 114/114 29/28 7/7 | 36/35 2/2 
atal 114/114 28/28 4/7 82/35 2/2 
partial 114 28 9 37 2 
PDI-4 full 474 286 0 287 2 
solitaire 474/474 284/281 46/14 331/295 2/2 
cycles 474/474 284/287 92/91 376/378 2/2 
fatal 474/474 285/283 80/91 365/374 2/2 
partial 474 286 64 350 2 
5.3 Results 


In Tables 2 and 3 we compare the on-the-fly solving strategies presented in 
Section 4. In the ‘Strategy’ column we indicate the on-the-fly solving strategy 
that is used. Here full refers to a complete exploration followed by solving with 
the Zielonka recursive algorithm. We use solitaire to refer to solitaire winning 
cycle detection, cycles for forced winning cycle detection, fatal to refer to fatal 
attractors and finally partial for on-the-fly solving with a Zielonka solver on safe 
regions. For solvers with a standard variant and a variant that utilises the safe 
attractors the first number indicates the result of applying the (standard) solver 
on safe vertices, and the second number (following the slash ‘/’) indicates the 
result when using the solver that utilises safe attractors. 

The column ‘Vertices’ indicates the number of vertices explored in the game. 
In the next columns we indicate the time spent on exploring and solving specif- 
ically and the total time in seconds. We exclude the initialisation time that is 
common to all experiments. Finally, the last column indicates memory used by 
the tool in gigabytes. We report the average of 5 runs and have set a timeout 
(indicated by 1) at 1200 seconds per run. Table 2 contains all benchmarks that 
require a full exploration of the game graph, providing an indication of the over- 


Table 3. Experiments with parity games in which at least one partial solver terminates 
early. All run times are in seconds. The number of vertices is given in millions. For 
solvers with two variants the first number indicates the result of applying the solver 
on safe vertices, and following the slash ‘/’ the result when using the solver that uses 
safe attractors. Memory is given in gigabytes. Bold-faced numbers indicate the lowest 


value. 


Game 


SWP-1 


SWP-2 


WMS-1 


WMS-2 


WMS-3 


WMS-4 


BKE-2 


CCP-1 


CCP-2 


PDI-2 


PDI-3 
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Strategy Vertices (109) 


ull 
solitaire 
cycles 
atal 
partial 
full 
solitaire 
cycles 
atal 
partial 
ull 
solitaire 
cycles 
atal 
partial 
full 
solitaire 
cycles 
atal 
partial 
ull 
solitaire 
cycles 
atal 
partial 
full 
solitaire 
cycles 
atal 
partial 
ull 
solitaire 
cycles 
atal 
partial 
full 
solitaire 
cycles 
fatal 
partial 
ull 
solitaire 
cycles 
atal 
partial 
full 
solitaire 
cycles 
atal 
partial 
ull 
solitaire 
cycles 
atal 
partial 


13304 
15.1/0.4 
25.2/0.9 
15.1/0.4 

27.1 

1987 
1631/1987 
1774/1774 
0.007/0.007 
0.007 

270 
270/240 
270/270 
270/270 
270 

317 

7/7 

7/66 

7/66 

7 

317 
317/317 
317/317 
5/1 


0.0007/0.0001 
0.0007/0.0003 
0.0007/0.0003 
0.0007 

0.4 
0.003/0.003 
0.003/0.003 
0.006 /0.003 
0.003 

0.9 
0.02/0.007 
0.02/0.007 
0.02/0.007 


436 
436/436 
78/162 
75/84 
110 


Explore (s) 


228 
230/228 
65/102 
64/67 
82 


Solve (s) 


n/a 
27.3/0.1 
42.7/1.0 
29.4/0.4 


Total (s) Mem (GB) 


i 
35.8/1.5 2.8/1.5 
55.0/2.8 3.2/1.5 
38.4/1.7 3.1/1.5 

63.5 3.6 

i i 

i/t i/i 

i/t i/i 
1.3/1.0 | 14/12 
13, 1.4 

3.3 0.2 
3.6/2.9 0.3/0.2 
3.7/11.2 0.3/0.5 
3.4/11.7 0.3/0.5 
3.5 0.3 

3.6 0.2 
1.2/0.8 0.1/0.1 
1.2/3.4 0.1/0.2 
1.3/3.6 0.1/0.2 
1.3 0.1 

2.7 0.2 
3.1/2.9 0.2/0.2 
3.1/3.3 0.2/0.2 
0.7/0.2 0.1/0.1 
0.5 0.1 

i i 
39/38 2/2 
38/37 2/2 
38/37 2/2 
38 2 

979 28 
0.2/0.2 0.9/0.9 
0.2/0.2 0.9/0.9 
0.2/0.2 0.9/0.9 
0.2 0.9 

32 2 
1.1/1.1 2/2 
1.1/1.1 2/2 
1.4/1.2 1.5/1.5 
1.1 1.5 

68 1.7 
1.8/1.1  1.5/1.5 
24/12  1.5/1.5 
18/13 . 1.5/1.5 
1.8 1.5 

43 2 
67/45 2/2 
17/19 2/2 
18/19 2/2 
51 2 

236 2 
266/260 2/2 
84/166 2/2 
83/90 2/2 
112 2 
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head in cases where this is unavoidable; Table 3 contains all benchmarks where 
at least one of the partial solvers allows exploration to terminate early. 

For games SWP-1, WMS-1, WMS-2 in Table 3 we find that solitaire, and in 
particular the safe attractor variant, is able to determine the solution the fastest. 
Also, for all entries in Table 2 this is the solver with the least overhead. Next, we 
observe that for cases such as WMS-1 and PDI-3 using the safe attractor variants 
of the solvers can be detrimental. Our observation is that first computing safe 
sets (especially using chaining) can be quick when most vertices are owned by 
one player and one priority and the computation of the safe attractor, which uses 
the more difficult safe control predecessor is more involved in such cases. There 
are also cases WMS-3, WMS-4, CCP-1 and CCP-2 where the safe attractor 
variants are faster and these cases all have multiple priorities. In cases where 
these solvers are slow (for example PDI-3) we also observe that more states are 
explored before termination, because the earlier mentioned time based heuristic 
results in calling the solver significantly less frequently. 

For parity games SWP-2 and WMS-3 only fatal and partial are able to find 
a solution early, which shows that more powerful partial solvers can be useful. 
From Table 2 and the cases in which the safe attractor variants perform poorly 
we learn that the partial solvers can, as expected, cause overhead. This overhead 
is in our benchmarks on average 30 percent, but when it terminates early it can 
be very beneficial, achieving speed-ups of up to several orders of magnitude. 


6 Conclusion 


In this work we have developed the theory to reason about on-the-fly solving 
of parity games, independent of the strategy that is used to explore games. We 
have introduced the notion of safe vertices, shown their correctness, proven an 
optimality result, and we have studied partial solvers and shown that these can 
be made to run without determining the safe vertices first; which can be useful 
for on-the-fly solving. Finally, we have demonstrated the practical purpose of our 
method and observed that solitaire winning cycle detection with safe attractors 
is almost always beneficial with minimal overhead, but also that more powerful 
partial solvers can be useful. 

Based on our experiments, one can make an educated guess which partial 
solver to select in particular cases; we believe that this selection could even be 
steered by analysing the parameterised Boolean equation system representing the 
parity game. It would furthermore be interesting to study (practical) improve- 
ments for the safe attractors, and their use in Zielonka’s recursive algorithm. 
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Abstract. Partition refinement is a method for minimizing automata 
and transition systems of various types. Recently we have developed a 
partition refinement algorithm and the tool CoPaR that is generic in the 
transition type of the input system and matches the theoretical run time 
of the best known algorithms for many concrete system types. Genericity 
is achieved by modelling transition types as functors on sets and systems 
as coalgebras. Experimentation has shown that memory consumption is 
a bottleneck for handling systems with a large state space, while running 
times are fast. We have therefore extended an algorithm due to Blom 
and Orzan, which is suitable for a distributed implementation to the 
coalgebraic level of genericity, and implemented it in CoPaR. Experiments 
show that this allows to handle much larger state spaces. Running times 
are low in most experiments, but there is a significant penalty for some. 


1 Introduction 


Minimization is an important and basic algorithmic task on state-based systems, 
concerned with reducing the state space as much as possible while retaining 
the system’s behaviour. It is used for equivalence checking of systems and as a 
subtask in model checking tools in order to handle larger state spaces and thus 
mitigate the state-explosion problem. 

We focus on the task of identifying behaviourally equivalent states modulo 
bisimilarity. For classic labelled transitions systems this notion obeys the principle 
‘states s and t are bisimilar if for every transition s — s', there exists a transition 
t = ť with s and t bisimilar’, and symmetrically for transitions from t. 
Bisimilarity is a rather fine-grained branching-time notion of equivalence (cf. [17]); 
it is widely used and preserves all properties expressible as y-calculus formulas. 
Moreover, it has been generalized to yield equivalence notions for many other 
types of state-based systems and automata. 

Due to the above principle, bisimilarity is defined by a fixed point, to be 
understood as a greatest fixed point and is hence approximable from above. 
This is used by partition refinement algorithms: The initial partition considers 
all states tentatively equivalent is then iteratively refined using observations 
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about the states until a fixed point is reached. Consequently, such procedures 
run in polynomial time and can also be efficiently implemented, in contrast to 
coarser system equivalences such as trace equivalence and language equivalence 
of nondeterministic systems which are PSPACE-complete [23]. This makes mini- 
mization under bisimilarity interesting even in cases where the main equivalence 
is linear-time, such as for automata. 


Efficient partition refinement algorithms exist for various systems: Kanellakis 
and Smolka provide a minimization algorithm with run time O(m- n) for labelled 
transition systems with n states and m transitions. Even faster algorithms have 
been developed over the past 50 years for many types of systems. For example, 
Hopcroft's algorithm for minimizing deterministic automata has run time in 
O (n-log n) [21]; it was later generalized to variable input alphabets, with run time 
O(n-|A|-log n) [18,24]. The Paige-Tarjan algorithm minimizes transition systems 
in time O((m + n) - log n) [31], and generalizations to labelled transition systems 
have the same time complexity [13, 22, 36]. For the minimization of weighted 
systems (a.k.a. lumping), Valmari and Franchescini [38] have developed a simple 
O((m 4- n) -logmn) algorithm for systems with rational weights. Buchholz [10] gave 
an algorithm for weighted automata, and Hógberg et al. [20] one for (bottom-up) 
weighted trees automata, both with run time in O(m- n). 


In previous work [16,42], we have provided an efficient partition refinement 
algorithm, which is generic in the system type, captures all the above system 
types, and matches or, in some cases even improves on the run time complexity of 
the respective specialized algorithms. Subsequently, we have shown how to extend 
the generic complexity analysis to weighted tree automata and implemented the 
algorithm in the tool CoPaR [11,41], again matching the previous best run time 
complexity and improving it in the case of weighted tree automata with weights 
from a non-cancellative monoid. The algorithm is based on ideas of Paige and 
Tarjan, which leads to its efficiency. Genericity is achieved by modelling state 
based systems as coalgebras, following the paradigm of universal coalgebra [34], 
in which the transitions structure of systems is encapsulated by a set functor. 
The algorithm and tool are modular in the sense that functors can be built 
from a preimplemented set of basic functors by standard set constructions such 
as cartesian product, disjoint union and functor composition. The tool then 
automatically derives a parser for input coalgebras of the composed type and 
provides a corresponding partition refinement implementation off the shelf. In 
addition, new basic functors F may easily be added to the set of basic functors by 
implementing a simple refinement interface for them plus a parser for encoded F- 
coalgebras. Our experiments with the tool have shown that run time scales 
well with the size of systems. However, memory usage becomes a bottleneck 
with growing system size, a problem that has previously also been observed by 
Valmari [37] for partition refinement. One strategy to address this is to distribute 
the algorithm across multiple computers, which store and process only a part 
of the state space and communicate via message passing. For ordinary labelled 
transition systems and Markov systems this has been investigated in a series 
of papers by Blom and Orzan [4-9] who were also motivated to mitigate the 
memory bottleneck of sequential partition refinement algorithms. 
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Our contribution in this paper is an extension of CoPaR by an efficient dis- 
tributed partition algorithm in coalgebraic generality. Like in Blom and Orzan's 
work, our algorithm is a distributed version of a simple but effective algorithm 
called “the naive method" [23], or “the final chain algorithm" in coalgebraic 
generality [25, 42]. We first generalize signature refinement introduced by Blom 
and Orzan to the level of coalgebras. We also combine generalized signatures (Sec- 
tion 3) with the previous encodings of set functors and their coalgebras [11, 41] via 
the new notion of a signature interface (Definition 3.1). This is a key idea to make 
coalgebraic signature refinement and the final chain algorithm implementable in a 
tool like CoPaR. In addition, we demonstrate how signature interfaces of functors 
can be combined (Construction 3.3 and Proposition 3.4) along standard functor 
constructions. This yields a similar modularity principle than for the previous 
sequential algorithm. However, this is a new feature for signature refinement 
and also, to our knowledge, for the final chain algorithm. Consequently, our 
distributed, modular and generic implementation of the final chain algorithm is 
new (already as sequential algorithm). 

We also provide experiments demonstrating its scalability and show that much 
larger state spaces can indeed be handled. Our benchmarks include weighted tree 
automata for non-cancellative monoids, a type of system for which our previous 
sequential implementation is heavily limited by its memory requirements. For 
those systems the running times of the distributed algorithm are even faster then 
those of the sequential algorithm. In a second set of benchmarks stemming from 
the PRISM benchmark suite [27] we again show that larger systems can now be 
handled; however, for some of these there is a penalty in run time. 


Related work. Balcazar et al. [1] have proved that the problem of bisimilarity 
checking for labelled transition systems is P-complete, which implies that it is 
hard to parallelize efficiently. Nevertheless, parallel algorithms have been proposed 
by Rajasekaran and Lee [33]. These are designed for shared memory machines 
and hence do not distribute RAM requirements over multiple machines. 

Symbolic techniques are an orthogonal approach to reduce memory usage of 
partition refinement algorithms and have been explored e.g. by Wimmer et al. [40] 
and van Dijk and de Pol [15]. 

Two other orthogonal extensions of the generic coalgebraic minimization and 
CoPaR have been presented in recent work. First a non-trivial extension computes 
(1) reachable states and (2) the transition structure of the minimized systems [12]. 
Second, Wifimann et al. [43] have shown how to compute distinguishing formulas 
in a Hennessy-Milner style logic for a pair of behaviourally inequivalent states. 


2 Preliminaries 


Our algorithmic framework and the tool CoPaR [41,42] are based on modelling 
state-based systems abstractly as coalgebras for a (set) functor that encapsulates 
the transition type, following the paradigm of universal coalgebra [34]. We now 
recall some standard notations for sets and maps and basic notions and examples 
in coalgebra. We fix a singleton set 1 = {x}; for every set X we have a unique 
map !: X — 1 and the identity map idx: X — X. We denote composition of 
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maps by (—) - (—), in applicative order. Given maps f: X > A, g: X — B we 
define (f,g): X > A x B by (f, g)(x) = (f(x), g(z)). The type of transitions of 
states in a system is modelled by a set functor F. Informally, F assigns to every 
set X a set FX of structured collections of elements of X, and an F-coalgebra is 
a map c: $ > FS which assigns to every state s € S in a system a structured 
collection c(s) € FS of successor states of s. The functor F also determines a 
canonical notion of behavioural equivalence of states of a coalgebra; this arises 
by stipulating that morphisms of coalgebras are behaviour preserving maps. 


Definition 2.1. A functor F: Set > Set assigns to each set X a set FX and 
to each map f: X — Y a map Ff: FX — FY, preserving identities and 
composition (Fidx = idgx, F(g:f) = Fg- Ff). An F-coalgebra (S, c) consists of 
a set S of states and a transition structure c: S > FS. A morphism h: (S,c) > 
(S", c") of F-coalgebras is a map h: S — S’ that preserves the transition structure, 
ie. Fh. c — c: h. Two states s, t € S of a coalgebra c: S + FS are behaviourally 
equivalent (s ~ t) if there exists a coalgebra morphism h with h(s) = h(t). 


Example 2.2. We mention several types of systems which are instances of the 
general notion of coalgebra and the ensuing notion of behavioural equivalence. 
All these are possible input systems for our tool CoPaR. 

(1) Transition systems. The finite powerset functor Py maps a set X to the 
set P, X of all finite subsets of X, and a map f: X — Y to the map Puf = 
f[-]: P, X > PuY taking direct images. Coalgebras for P, are finitely branching 
(unlabelled) transition systems. Two states are behaviourally equivalent iff they 
are (strongly) bisimilar in the sense of Milner [29, 30] and Park [32]. Similarly, 
finitely branching labelled transition systems with label alphabet A are coalgebras 
for the functor FX = P,(A x X). 

(2) Deterministic automata. For an input alphabet A, the functor given by 
FX —2 x X^, where 2 = {0,1}, sends a set X to the set of pairs of boolean 
values and functions A — X. An F-coalgebra (S, c) is a deterministic automaton 
(without an initial state). For each state s € S, the first component of c(s) 
determines whether s is a final state, and the second component is the successor 
function A — S mapping each input letter a € A to the successor state of s 
under input letter a. States s,t € S are behaviourally equivalent iff they accept 
the same language in the usual sense. 

(3) Weighted tree automata simultaneously generalize tree automata and weight- 
ed (word) automata. Inputs of such automata stem from a finite signature X, 
i.e. a finite set of input symbols, each with a prescribed natural number, its 
arity. Weights are taken from a commutative monoid (M, --,0). A (bottom-up) 
weighted tree automaton (WTA) (over M with inputs from X) consists of a finite 
set S of states, an output map f: S — M, and for each k > 0, a transition map 
Hk: ik > MS*xS. where 3j denotes the set of k-ary input symbols in X; the 
maximum arity of symbols in X is called the rank. 

Every signature X gives rise to its associated polynomial functor, also de- 
noted X, which assigns to a set X the set [ [,, c Xn x X", where [ [ denotes disjoint 
union (coproduct). Further, for a given monoid (M,+,0) the monoid-valued func- 
tor MC? sends a set X to the set of maps f: X — M that are finitely supported, 
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ie. f(x) = 0 for almost all x € X. Given a map f: X > Y, MP): MO — MO?) 
sends a map v: X > M in MC? to the map y > eX, f(a)=y V (t), correspond- 
ing to the standard image measure construction. 

Weighted tree automata are coalgebras for the composite functor FX — 
M x MX); indeed, given a coalgebra c = (c1,c2): S > M x M9), its first 
component c; is the output map, and the second component c» is equivalent to 
the family of transitions maps ju, described above. 

As proven by Wifimann et al. [41, Prop. 6.6], the coalgebraic behavioural 
equivalence is precisely backward bisimulation of weighted tree automata as 
introduced by Hógberg et al. [20, Def. 16]. 

(4) The bag functor B: Set — Set sends a set X to the set of all finite multisets 
(or bags) over X. This is the special case of the monoid-valued functor for the 
monoid (IN,+,0). Accordingly, B-coalgebras are weighted transition systems 
with positive integers as weights, or they may be regarded as finitely branching 
transition systems where multiple transitions between a pair of states are allowed. 
Behavioural equivalence coincides with weighted (or strong) bisimilarity. 

(5) Markov chains. The finite distribution functor D,, is a subfunctor of the 
monoid-valued functor R) for the usual monoid of addition on the real numbers. 
It maps a set X to the set of all finite probability distributions on X. That 
means that D, X is the set of all finitely supported maps d: X — [0,1] such that 
cx U(x) = 1. The action of D, on maps is the same as that of RW). 

As shown by Rutten and de Vink [35], coalgebras c: S — (DoS + 1)^ are 
precisely Larsen and Skou's probabilistic transition systems [28] (aka. labelled 
Markov chains [14]) with the label alphabet A. In fact, for each state s € S 
and action label a € A, that state either cannot perform an a-action (when 
c(s)(a) € 1) or the distribution c(s)(a) determines for every state t € C the 
probability with which s transitions to t with an a-action. 

Coalgebraic behavioural equivalence is precisely probabilistic bisimilarity in 
the sense of Larsen and Skou, see Rutten and de Vink [35, Cor. 4.7]. 

(6) Markov decision processes are systems which feature both non-deterministic 
and probabilistic branching. They are coalgebras for composite functors such as 
P(A x D,(—)) or P,(D,(A x (—)) (simple/general Segala systems); Bartels et 
al. [2] list further functors for various species of probabilistic systems. 
Encodings. To supply coalgebras as inputs to CoPaR and in order to speak 
about the size of a coalgebra in terms of states and transitions, we need 


Definition 2.3 [12, Def. 3.1]. An encoding of a set functor F consists of a 
set A of labels and a family of maps by: FX — B(A x X), one for every set X, 
such that the map (F!,bx): FX — F1 x B(A x X) is injective. 

The encoding of a coalgebra c: S > FS is (Flbg) - c: S — F1 x B(A x S). 
For s € S we write s — t whenever (a, t) is contained in the bag bs(c(s)). The 
number of states and edges of a given encoded input coalgebra are n = |S| and 
m = » cs lPs(c(s))|, respectively, where |b| = $7, c x b(z) for a bag b: X > IN. 


An encoding of a set functor F specifies how F-coalgebras are represented as 
directed graphs, and the required injectivity ensures that different coalgebras 
have different encodings. 
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Example 2.4. We recall a few key examples of encodings used by CoPaR [42]; 
for the required injectivity, see [12, Prop. 3.3]. 

(1) For the finite powerset functor Py one takes a singleton label set A = 1 and 
bx: PyX — B(1 x X) is the obvious inclusion: bx (U)(*, 7) 2 1iff ze U C X. 
(2) For the monoid-valued functor MC we take labels A = M, and the map 
bx: M(X) > B(M x X) is given by bx(t)(m,x) = 1 if t(z) = m 4 0 and 0 else. 
(3) As a special case, the bag functor B has labels A — IN, and the map 
bx: BX > B(IN x X) is given by »x(t)(n, x) = 1 if t(x) = n and 0 else. 


Remark 2.5. (1) Readers familiar with category theory may wonder about the 
naturality of encodings bx. It turns out [12] that in almost all instances, our 
encodings are not natural transformations, except for polynomial functors. As 
shown in op. cit., all our encodings satisfy a property called uniformity, which 
implies that they are subnatural transformations [12, Prop. 3.15]. 

(2) Having an encoding of a set functor F does not imply a reduction of the 
problem of minimizing F-coalgebras to that of coalgebras for B(A x —). In fact, 
the behavioural equivalence of F-coalgebras and coalgebras for B(A x —) may 
be very different unless bx is natural, which is not the case for most encodings. 


Functors in CoPaR can be combined by product, coproduct or composition, 
leading to modularity. But in order to automatically handle combined functors, 
our tool crucially depends on the ability to form products and coproducts of 
encodings [41, 42]. We refrain from going into technical details, but note for 
further use that given a pair of functors F, F with encodings A;,)x,; one 
obtains encodings for the functors P, x F (cartesian product) and F; + F> 
(disjoint union) with the label set A = A4 + A2. 


Input syntax and processing. We briefly recall the input format of CoPaR 
and how inputs are processed; for more details see [41, Sec. 3.1]. CoPaR accepts 
input files representing a finite F-coalgebra. The first line of an input file specifies 
the functor F which is written as a term according to the following grammar: 


T:=X|P,T|BT|D,T|M |x 


EXu-C|T-T|TxT|T^ C:=N|A Az= {51,...,5n}| n, Z 
where n € IN denotes the set {0,...,n— 1}, the s; are strings subject to the usual 
conventions for variable names (a letter or an underscore character followed by 
alphanumeric characters or underscore), exponents F^ are written F^A, and M 
is one of the monoids (Z, +,0), (R, 7,0), (C, +,0), (P,(64),U, 0) (the monoid 
of 64-bit words with bitwise or), and (IN, max,0) (the additive monoid of the 
tropical semiring). Note that C effectively ranges over at most countable sets, 
and A over finite sets. A term T determines a functor F': Set — Set in the evident 
way, with X interpreted as the argument. 

The remaining lines of an input file specify a finite coalgebra c: S — FS. Each 
line has the form s: for a state s € S, and t represents the element c(s) € FS. 
The syntax for t depends on the specified functor F and follows the structure of 
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d EM (fn) x X^(a,b) ON 
b 
q ipte, e 0:5} 2 | 5 p q: (n, a: p, b: r}) cv VD 
(ed p: (n, fa: q, b: r}) on 


p: {q: 0.4, r: 0.6} 
r: (f, (a: q, b: p) 


(a) Markov chain (b) Deterministic finite automaton 


r: {r: 1} 


Fig. 1: Examples of input files with encoded coalgebras [41] 


the term T defining F; the details are explained in [41, Sec. 3.1.2]. Fig. 1 from 
op. cit. shows two coalgebras and the corresponding input files. 

After reading the functor term T, CoPaR builds a parser for the functor- 
specific input format and then parses the input coalgebra given in that format 
into an intermediate format which internally represents the encoding of the 
input coalgebra (Definition 2.3). For composite functors the parsed coalgebra 
then undergoes a substantial amount of preprocessing, which also affects how 
transitions are counted; see [41, Sec. 3.5] for more details. 


3  Coalgebraic Partition Refinement 


As mentioned in the introduction, the sequential partition refinement algorithm 
previously implemented in CoPaR is based on ideas used in the Paige-Tarjan 
algorithm [31] for transition systems. However, as has been mentioned by Blom 
and Orzan [8], the Paige-Tarjan algorithm carefully selects the block of states to 
split in each iteration, and the data structures used for this selection take a lot of 
memory and require modification to allow a distributed implementation. Hence, 
Blom and Orzan have built their distributed algorithm from a rather simple 
sequential partition refinement algorithm based on what Kanellakis and Smolka 
refer to as the naive method [23]. We now recall this algorithm and subsequently 
show how it can be adapted to the coalgebraic level of generality. 


Signature Refinement. Given a finite labelled transition system with the state 
set S, a partition on S may be presented by a function 7: S — IN, i.e. two states 
s,t € S lie in the same block of the partition iff 7(s) = (t). The signature of a 
state s € S is the set of outgoing transitions to blocks of 7: 


sig.(s) = ((a,n(t)) | s = t} € P,(A x IN). (2) 


A signature refinement step then refines 7 by putting s,t € S into different blocks 
iff sig (s) Z sig, (1). Concretely, we put 7new(s) = hash(sig,(s)) using a perfect, 
deterministic hash function hash. The signature refinement algorithm (Fig. 2) 
starts with a trivial initial partition on S and repeats the refinement step until 
the partition stabilizes, i.e. until two subsequent partitions have the same size. 


Coalgebraic Signature Refinement. Regarding a labelled transition system 
as a coalgebra c: S > P,,(A x S) (Example 2.2(1)), signatures are obtained by 
postcomposing the transition structure with the partition under the functor: 


Po (Axr) 
— 


sig, = S = P(A x S) P(A x IN). (3) 
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Variables :old and new partitions represented by 7,7:5ew: S — IN with sizes 
l,lnew, resp.; set H for counting block numbers; 

1 foreach s € S do 

2 | Tnew(s) + 0; 

3 end 

4 Inew €— l; 

5 while | Æ lnew do 

6 T + Trew, H + O; 

7 foreach s € S do 

8 Tnew(s) + hash(sig,(s)); 

9 H + HU {tnew(s)}; 


10 end 

11 l4 Inew; 
12 lnew < | H|; 
13 end 


Fig. 2: Signature refinement for labelled transition systems 


The generalisation to coalgebras for arbitrary F is immediate: the signature 
of a state of an F-coalgebra c: S > FS w.r.t. a partition m is given by the 
function sig, = Fr : c. In the refinement step of the above algorithm two states 
are identified by the next partition if they have the same signatures currently: 


"new (S) = 7new(f) 4> sig,(s) = sig, (t) = (Fr)(els)) = (Fr)(e(t)). (4) 


Hence, the algorithm in fact simply applies F(—) -c to the initial partition 
corresponding to the trivial quotient !: S — 1 until stability is reached. Note that 
this is precisely the Final Chain Algorithm by König and Küpper [25, Alg. 3.2] 
computing behavioural equivalence of a given F-coalgebra. Its correctness thus 
proves correctness of the coalgebraic signature refinement which is the algorithm 
in Fig. 2 with sig, = Fr : c. Since we represent functors and their coalgebras by 
encodings we use an interface to F to compute signatures based on encodings. 


Definition 3.1. Given a functor F with encoding A,bx, a signature interface 
consists of a function sig: F1 x B(A x IN) > FN such that for every finite set S 
and every partition 7: S — IN we have 


(Flbs) 


eo S pl Bau eie. 


F1 x B(A x IN) 725 FN). (5) 
Given a coalgebra c: S > FS, a state s € S and a partition 7: S — N, the two 
arguments of sig should be understood as follows. T'he first argument is the value 
F\(c(s)) € F1, which intuitively provides an observable output of the state s. 
The second argument is the bag B(A x «)(Ps(c(s)) formed by those pairs (a, m) 
of labels a and numbers n of blocks of the partition m to which s has an edge; 
that is, that bag contains one pair (a, n) for each edge s —*+ s' where m(s/) = n. 
'Thus, when supplied with these inputs, sig correctly computes the signature of s; 
indeed, to see this, precompose equation (5) with the coalgebra structure c. 


Example 3.2. (1) The constant functor !C has the label set A = 0, so we have 
B(@ x IN) = 1, and we define the function sig: C x B(@ x IN) > C by sig(c, *) = c. 
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(2) The powerset functor P, has the label set A = 1, and we define the function 
sig: P,1 x B(1x1N) > P-N by sig(z, b) = {n : b(*,n) #0}. 

(3) The monoid-valued functor IR^ has the label set A = IR, and we define the 
function sig: IR x B(IR x IN) + R® by sig(z, b)(n) = X(r | b(r,n) z 0}. 


Next we show how signature interfaces can be combined by products (x) and 
coproducts (4-). This is the key to the modularity of the implementation (be it 
distributed or sequential) of the coalgebraic signature refinement in CoPaR. 


Construction 3.3. Given a pair of functors Fi, F> with encodings A;,)x,; and 
signature interfaces sig;, we put A = A; + A» and define the following functions: 
(1) for the product functor F = Fix F> we take sig: F1x B(AxIN) > Fi Nx FN, 


sig(t, b) = (sig; (pry (t), filter; (b)), siga (pra (t), filter (b))) , 


Here, pr;: F1 Filis the projection map and filter;: B(A x IN) > B(A; x IN) is 
given by filter;(b)(a,m) = b(in; a, n), where in;: F;IN —^ FN is the injection map. 
(2) for the coproduct functor F = Fı + F, we take 


sig: F1 x B(A x N) > FAN 4 FN, sig(in; t, b) = in;(sig;(t, filter;(b))). 


Proposition 3.4. The functions sig defined in Construction 3.3 yield signature 
interfaces for the functors Fi x F5 and F3 + F5, respectively. 


As a consequence of this result, it suffices to implement signature interfaces 
only for basic functors according to the grammar in (1), i.e. the trivial identity 
and constant functors as well as the functors Pu, B, Du and the supported 
monoid-valued functors M (^. Signature interfaces of products, coproducts and 
exponents, being a special form of product, are derived using Construction 3.3. 

Functor composition can be reduced to these constructions by a technique 
called desorting [42, Sec. 8.2], which transforms a coalgebra of a composite functor 
into a coalgebra for a coproduct of basic functors whose signature interfaces can 
then be combined by + (see also [41, Sec. 3.5]). As for the previous Paige-Tarjan 
style algorithm, this leads to the modularity in the functor of the coalgebraic 
signature refinement algorithm: signature interfaces for composed functors are 
automatically derived in CoPaR. Moreover, a new basic functor F may be added 
by implementing a signature interface for F, effectively extending the grammar 
of supported functors in (1) by a clause FT. 


4 The Distributed Algorithm 


Our distributed algorithm for coalgebraic signature refinement is a generalization 
of Blom and Orzan's original algorithm [8] to coalgebras. We highlight differences 
to op. cit. at the end of this section. 

We assume a distributed high-bandwidth cluster of W workers w1,...,ww 
that is failure-free, i.e. nodes do not crash, messages do not get lost and between 
two nodes the order of messages is preserved. The communication is based on 
non-blocking send operations and blocking receive operations. Messages are triples 
of the form (from, to, data), where the data field may be structured and will often 
contain a tag to simplify interpretation. 
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Description. The distributed algorithm is based on the sequential algorithm 
presented in Fig. 2, using a distributed hashtable to keep track of the partition. 
As for the sequential algorithm, the input consists of an F-coalgebra (S,c) 
with |S| = n states. We split the state space evenly among the workers as a 
preprocessing step. We write S; with |S;| = n/W for the set of states of worker w;. 
The input for worker w; is the encoding of that part of the transition structure of 
the input coalgebra which is needed to compute the signatures of the states in 5;. 
This information is presented to w; as the list of all outgoing edges of states of S; 
in the encoding of the coalgebra (S, c), i.e. the list of all s — t with s € S; 
(cf. Definition 2.3). We refer to the block number 7(s) of a state s € S as its ID. 

After processing the input, the algorithm runs in two phases. In the /nitializa- 
tion Phase (Fig. 3) the workers exchange update demands about the IDs stored 
in the distributed hashtable. If w; has an edge s — s’ into some state s’ of Wj, 
then during refinement w; needs to be kept up to date about the ID of s' and thus 
instructs w; to do so. Worker w; remembers this information by storing w; in 
the set Ins = (w; | ds € S;,a € A. s + s') of incoming edges of s' (lines 14-16). 
Hence, for each edge s — s' with s € S; and s' € Sj, worker w; sends a message 
to wj, informing w; to add w; to Ins (lines 5-8). 


Variables :Set V of visited states; process count d; 
for each s € S; a list Ins of workers with an edge into s 


i V —0,d + 0; 
2 foreach s € S; do 
s | In. + []; 
4 end , f 14 on receive (wi, wi, s) do 
5 rean edge s — s' of w; with i5 | Ins + (we :: Ins); 
s £V do 16 end 
6 V —Vu(ts'y 
7 send(w;, wj, s"); 
5 end 17 on receive( , ,DONE) do 
9 foreach 1 < j < W do 18 | dc d+; 
10 | send(w; wj, DONE); aon 
11 end 


12 waitFor(d = W); 
13 return([In, | s € S;]; 


Fig. 3: Initialization Phase of worker w; 


The main phase is the Refinement Phase (Fig. 4), mimicking the refinement 
loop of the undistributed algorithm. In each iteration all workers compute their 
part of the new partition, i.e. the IDs hs = hash(sig,(s)) for each of their states 
s € S; (line 5). In addition, every worker w; is responsible for sending the 
computed ID of s € S; to workers in In, that need it for computation of their 
own signatures in the next iteration (lines 6-9). The IDs are also sent to a 
designated worker counterOf(h,) (lines 10-12). This ensures that IDs are counted 
precisely once at the end of the round when the partition size is computed after 
all messages have been received (lines 14-17). The actual counting (line 19) is a 
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Variables : Old, respectively new partitions 7,7»ew with sizes l, lnew; 
finished workers d; ID-counting set H; 


1 Tnew + 0l, 1 — —1, lnew + 0, H + 0); 
2 while | Æ lnew do 


3 l < Inew, T *— Tnew; 
4 foreach s € S; do 
5 Tnew(8) < hash(sig,, ()); 22 on receive 
6 foreach w; € In, do (wr, wi, (UPD, s,hs)) do 
7 send(wi, Wj, 23 | Tnew(8) hs; 
8 (UPD, s, Trew(S))); 24 end 
9 end 
10 send(wi, 25 on receive 
11 counterOf(Tnew(s)), (wk, wi, (COUNT, hs)) do 
12 (COUNT, trnew(s))); 26 | H+ HU {hs}; 
13 end 27 end 
14 foreach 1 < j < W do 
15 send(w;, wj, DONE); 28 on receive (_, wi, DONE) do 
18 end 29 d«—d-41; 
17 waitFor(d = W); 3d end 
18 l — Inew; 
19 Inew + distribSum(sizeOf(H)); 
20 synchronize; 
21 end 


Fig. 4: Refinement Phase of worker w; 


primitive operation in the MPI library, for an explicit O(log W) algorithm using 
messages see e.g. Blom and Orzan [8, Fig. 6]. Finally, the workers synchronize 
before starting the next iteration (line 20). The refinement phase stops if two 
consecutive partitions have the same size (line 2). 


Correctness. The Initialization Phase (Fig. 3) terminates since every worker 
reaches line 10, sends DONE to all workers and thus also receives it (lines 17-19) 
a total of W times, allowing it to progress past line 12. An analogous argument 
proves termination of every iteration of the Refinement Phase (Fig. 4). The 
sequential algorithm is correct, hence we know the loop of the refinement phase 
terminates when all IDs are computed and counted correctly, since then the 
distributed and the sequential algorithm compute precisely the same partitions. 

To show that the signatures are computed correctly, we note that if all DONE 
messages have been received in a round, then, by order-preservation of messages, 
all messages sent previously in this round have also been received. This ensures 
that no workers are missing from the lists In; computed in the Initialization Phase 
and that during the Refinement Phase new IDs are sent to all concerned workers 
(Fig. 4, lines 6-8). This establishes correctness of the signature computation, and 
the signatures coincide on all workers since we assume that the hash function is 
deterministic. Finally, the use of the counterOf function (line 11) ensures that 
each ID is included in the counting set of exactly one worker. Thus, the distributed 
sum of the sizes of all counting sets is equal to the size of the partition. 
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Complexity. Let us assume that not only states, but also outgoing transitions 
are distributed evenly among the workers, i.e. every worker has about m/W 
outgoing transitions. In the Initialization Phase, the loop sending messages runs 
in O( 1) and receiving takes O(W - 5.) = O(n), since for worker w; every other 
worker wj might have an edge into every state in S;. Both are executed in parallel 
so in total the phase runs in O(max(55,n)) = O(q +n). In the Refinement 
Phase, we assume the run time of computing signatures and their hashes is linear 
in the number of edges. Then the loop for computing and hashing (O( 5»)) and 
counting (O(7)) signatures runs in total in O(445*), since it is performed by 
all workers independently. Each worker receives at most m/W |D-updates each 
round and the partition size is computable in O(W) giving the complexity of one 


refinement step in O("5"). As many as n iterations might be needed for a total 


complexity of O(1» +n) - n: O( =) = o (mnt +n). 


Remark 4.1. The above analysis assumes that signature interfaces are imple- 
mented with a linear run time in their input bag. This could in fact be theoretically 
realized for all basic functors (whence also for their combinations) currently im- 
plemented in CoPaR, which would involve using bucket sort for the grouping of 
bag elements by the target block (second component), e.g. for monoid-valued 
functors. However, since the table used in bucket sort would be very large (the 
size of the last partition) and memory conscience is our main motivation, we 
opted for an implementation using a standard nog n sorting algorithm instead. 


Implementation details. CoPaR is implemented in Haskell. We were able 
to reuse, with only minor adjustments, major parts of the code base of CoPaR 
dedicated to the representation and processing of coalgebras. This includes the 
implemented functors and their encodings together with the corresponding parser 
and preprocessing algorithms (see Section 2). As explained in Section 3 the 
sequential Paige-Tarjan-style algorithm of CoPaR was not used; we implemented 
an additional “algorithmic frontend” to our “coalgebraic backend". To compute 
signatures during the Refinement Phase, each functor implements the signature 
interface (Definition 3.1), which is written in Haskell as follows: 


class Hashable (Signature f) => SignatureInterface f where 
type Signature f :: Type 
sig :: Fl1f -> [(Label f, Int)] -> Signature f 


We require in the second line a type Signature f, that serves as an implementa- 
tion-specific datatype representation of FIN. In the type of sig, the types f, Label f 
and F1f correspond to the name of F, its label type and the set F1, respectively. 


Example 4.2. The Haskell-implementation of the signature interface for the 
finite power set functor Py from Example 3.2(2) is as follows: 


data P x =P x — already defined in CoPaR 
type instance Label P = O —- also already defined 
instance SignatureInterface P where 

type Signature P — Set Int 
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sig :: F1f -> [(O, Int)] -> Set Int 
sig _ = setFromList . map snd 


Signature interfaces for the other basic functors according to the grammar in (1) 
are implemented similarly. For combined functors CoPaR automatically derives 
their signature interface based on Construction 3.3. 

In the algorithm itself, each worker runs three threads in parallel: The first 
thread is for computing, the second one is for sending and the third one is for 
receiving signatures. This allows us to keep calls to the MPI interface separated 
from (pure) signature computation, simplifying logic and allowing the workers 
to scatter the ID of one state while simultaneously computing the signature of 
the next one to ensure that neither signature computation nor network traffic 
become bottlenecks. For inter-thread communication and synchronization we rely 
on Haskell’s software transactional memory [19] to ease concurrent programming, 
e.g. to avoid race conditions. 


Comparison to Blom and Orzan's algorithm. We now discuss a few 
differences of our algorithm to Blom and Orzan’s original one [8]. 

In Blom and Orzan’s algorithm for LTSs the sets In, of s € S; are in fact lists 
and contain worker wx a total of r times if there exist r edges from states in Sk 
to s. This induces a redundancy in messages of ID updates, since w; sends r 
(instead of one) messages with the ID of s to wy. If the LTS has an average 
fanout of f then each worker has t = n/W - f outgoing transitions; this is the 
number of ID updates received every round. Since there are only n states, at most 
n/t = W/f of those messages are necessary. In our scenario, we have W < f for 
large coalgebras, hence the overhead becomes massive; e.g. for W — 10, f — 100 
already 90% of all ID messages are redundant. We use sets instead of lists for In, 
to avoid this redundancy. 

Signature computation and communication do not proceed simultaneously in 
Blom and Orzan’s original algorithm. However, in their optimized version [9] and 
in Blom et al’s algorithm for state labelled continuous-time Markov chains [4] 
they do. 

Another difference of our implementation is that we decided to hash the 
signatures directly on the workers of the respective states while Blom and Orzan 
decided to first send the signatures to some dedicated hashing worker who is 
then (uniquely) responsible for hashing, i.e. computing a new ID. This method 
allows to compute new IDs in constant time. However, for more complex functors 
supported by CoPaR, sending signatures could result in very large messages, so we 
opted for minimizing network traffic at the cost of slower signature computation. 


5 Evaluation 


To illustrate the practical utility and scalability of the algorithm and its im- 
plementation in CoPaR, we report on a number of benchmarks performed on 
a selection of randomly generated and real world data. In previous evaluations 
of sequential CoPaR [41], we were limited by the 16GB RAM of a standard 
workstation. Here we demonstrate that our distributed implementation fulfills its 


172 F. Birkmann, H.-P. Deifel, S. Milius 


main objective of handling larger systems without lifting the memory restriction 
per process. All benchmarks were run on a high performance computing cluster 
consisting of nodes with two Xeon 2660v2 *Ivy Bridge" chips (10 cores per 
chip + SMT) with 2.2GHz clock rate and 64GB RAM. The nodes are connected 
by a fat-tree InfiniBand interconnect fabric with 40 GBit/s bandwidth. Most 
execution runs were performed using 32 workers on 8 nodes, resulting in 4 worker 
processes per node. No process used more than 16GB RAM. Execution times of 
the sequential algorithm were taken using one node of the cluster. No times are 
given for executions that ran out of 16GB memory previously [41]; those were 
not run on the cluster. 


Weighted Tree Automata. In previous work [41], we have determined the size 
of the largest weighted tree automata for different parameters that the sequential 
version of CoPaR could handle in 16GB of RAM. Here, we demonstrate that the 
distributed version can indeed overcome these memory constraints and process 
much larger inputs. 

Recall from Example 2.2 that weighted tree automata are coalgebras for the 
functor FX = M x MCU/X), For these benchmarks, we use XX = 4x X” with rank 
r € {1,...,5} and the monoids (2, V,0) (available as the finite powerset functor 
in CoPaR), (IN, max,0) and (7,(64), U, 0). To generate a random automaton 
with n states, we uniformly chose k = 50-n transitions from the set of all possible 
transitions (using an efficient sampling algorithm by Vitter [39]) resulting in a 
coalgebra encoding with n’ = 51-n states and m = (r+ 1) - k edges. We took 
care to restrict the state and transition weights to at most 50 different monoid 
elements in each example, to avoid the situation where all states are already 
distinguished in the first iteration of the algorithm. 

Table 1 lists results for both the 
sequential and distributed implemen- 
tation when run on the same input. 
These are the largest WTAs for their 
respective rank and monoid that se- 
quential CoPaR could handle using at 
most 16GB of RAM [41]. In contrast, 
the distributed implementation uses 
less than 1GB per worker for those 
examples and is thus able to handle 
much larger inputs. Incidentally, the 
distributed implementation is also faster despite the overhead incurred by network 
communication. This can partly be attributed to the input-parsing stage, which 
does not need inter-worker synchronization and is thus perfectly parallelizable. 

'To test the scaling properties of the distributed algorithm, we ran CoPaR with 
the same input WTA but a varying number of worker processes. For this we chose 
the WTA for the monoid (2, V,0) with X X = 4 x X? having 86852 states with 
4342600 transitions and file size 186MB. The figure on the right above depicts 
the maximum memory usage per worker and the overall running time. The results 
show that both data points scale nicely with up to 32 workers, but while the 
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running time even increases when using up to 128 workers, the memory usage per 
worker (the main motivation for this work) continues to decrease significantly. 


Monoid r k n Mem. (MB) Time (s) Seq. Time (s) 

5 4630750 92615 849 61 511 

4 4171550 83431 663 52 642 

3 4721250 94425 639 59 528 

(Pa (64), U, 0) 2 6704100 134082 675 76 471 
1 7605350 152107 642 79 566 

3 47212500 944250 6786 675 -= 

5 4722550 94451 871 61 445 

1 4643950 92879 754 56 463 

(N, max, 0) 3 5039950 100799 628 64 391 
i ? 2 5904200 118084 633 74 403 

1 7845650 156913 677 82 438 

3 50399500 1007990 5644 645 B 

5 4342600 86852 701 71 537 

4 4624550 92491 728 67 723 
(2, v,0) 3 6710350 134207 825 113 689 
S 2 6900000 138000 715 129 467 

1 7743150 — 154863 621 160 449 

3 65000000 1300000 1092 1377 B 


? 


Table 1: Maximally manageable WTAs for sequential CoPaR; “Mem.” and “Time’ 
are the memory and time required for the distributed algorithm and are the 
maximum over all workers. “Seq. Time" is the time needed by sequential CoPaR. 


PRISM Models. Finally, we show how our distributed partition refinement 
implementation performs on models from the benchmark suite [27] of the PRISM 
model checker [26]. These model (aspects of) real-world protocols and are thus 
a good fit to evaluate how CoPaR performs on inputs that arise in practice. 
Specifically, we use the fms and wlan time bounded families of systems. These 
are continuous time Markov chains, regarded as coalgebras for FX = RCO, and 
Markov decision processes regarded as coalgebras for FX = IN x P, (IN x (D, X)), 
respectively. Again, our translation to coalgebras took care to force a coarse 
initial partition in the algorithm. 

'The results in Table 2 show that the distributed implementation is again able 
to handle larger systems than sequential CoPaR in 16GB of RAM per process. 
For the fms benchmarks, the distributed implementation is again faster than the 
sequential one. However, this is not the case for the wlan examples. T'he larger 
run times might be explained by the much higher number of iterations of the 
refinement phase (i-column of the table). This means that only few states are 
distinguished in each phase, and thus signatures are re-computed more often and 
more network traffic is incurred. 
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Model n m Mem. (MB) Time (s) i Seq. Time (s) 
fms (n=4) 35910 237120 13 2 4 4 
fms (n=5) 152712 1111482 62 8 5 17 
fms (n=6) 537768 4205670 163 26 5 68 
fms (n=7) 1639440 13552968 514 84 5 232 
fms (n=8) 4459455 38533968 1690 406 7 — 
wlan tb (K=0) 582327 771088 90 297 306 39 
wlan tb (K=1) 1408676 1963522 147 855 314 105 
wlan tb (K=2) 1632799 5456481 379 2960 374 = 


Table 2: Benchmarks on PRISM models: n and m are the numbers of states and 
edges of the input coalgebra; i is the number of refinement steps (iterations). The 
other columns are analogous to Table 1. 


6 Conclusions and Future Work 


We have presented a new and simple partition refinement algorithm in coalgebraic 
genericity which easily lends itself to a distributed implementation. Our algorithm 
is based on Konig and Küpper's final chain algorithm [25] and Blom and Orzan’s 
signature refinement algorithm for labelled transition systems [8]. We have 
provided a distributed implementation in the tool CoPaR. Like the previous 
sequential Paige-Tarjan style partition refinement algorithm, our new algorithm 
is modular in the system type. This is made possible by combining signature 
interfaces by product and coproduct, which is used by CoPaR for handling 
combined type functors. Experimentation has shown that with the distributed 
algorithm CoPaR can handle larger state spaces in general. Run times stay low for 
weighted tree automata, whereas we observed severe penalties on some models 
from the PRISM benchmark suite. 

An additional optimization of the coalgebraic signature refinement algorithm 
should be possible using Blom and Orzan’s idea [9] to mark in each iteration 
those states whose signatures can change in the next iteration and only recompute 
signatures for those states in the next round. This might mitigate the run time 
penalties we have seen in some of the PRISM benchmarks. 

Further work on CoPaR concerns symbolic techniques: we have a prototype 
sequential implementation of the coalgebraic signature refinement algorithm 
where state spaces are represented using BDDs. In a subsequent step it could be 
investigated whether this can be distributed. In another direction the distributed 
algorithm might be extended to compute distinguishing formulas, as recently 
achieved for the sequential algorithm [43], for which there is also an implemented 
prototype. Finally, there is still work required to integrate all these new fea- 
tures, i.e. distribution, distinguishing formulas, reachability and computation of 
minimized systems, into one version of CoPaR. 


Data Availability Statement The software CoPaR and the input files that 
were used to produce the results in this paper are available for download [3]. The 
latest version of CoPaR can be obtained at https: / /git8.cs.fau.de/software/copar. 
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Abstract. We present a bounded equivalence verification technique 
for higher-order programs with local state. This technique combines 
fully abstract symbolic environmental bisimulations similar to symbolic 
game semantics, novel up-to techniques, and lightweight state invariant 
annotations. This yields an equivalence verification technique with no 
false positives or negatives. The technique is bounded-complete, in that 
all inequivalences are automatically detected given large enough bounds. 
Moreover, several hard equivalences are proved automatically or after 
being annotated with state invariants. We realise the technique in a tool 
prototype called HOBBIT and benchmark it with an extensive set of new 
and existing examples. HOBBIT can prove many classical equivalences 
including all Meyer and Sieber examples. 


Keywords: Contextual equivalence - bounded model checking - symbolic 
bisimulation - up-to techniques - operational game semantics. 


1 Introduction 


Contextual equivalence is a relation over program expressions which guaran- 
tees that related expressions are interchangeable in any program context. It 
encompasses verification properties like safety and termination. It has attracted 
considerable attention from the semantics community (cf. the 2017 Alonzo Church 
Award), and has found its main applications in the verification of cryptographic 
protocols [4], compiler correctness [26] and regression verification [10,11,9,17]. 
In its full generality, contextual equivalence is hard as it requires reasoning 
about the behaviour of all program contexts, and becomes even more difficult in 
languages with higher-order features (e.g. callbacks) and local state. Advances in 
bisimulations [16,29,3], logical relations [1,13,15] and game semantics [18,25,8,20] 
have offered powerful theoretical techniques for hand-written proofs of contextual 
equivalence in higher-order languages with state. However, these advancements 
have yet to be fully integrated in verification tools for contextual equivalence 
in programming languages, especially in the case of bisimulation techniques. 
Existing tools [12,24,14] only tackle carefully delineated language fragments. 
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In this paper we aim to push the frontier further by proposing a bounded 
model checking technique for contextual equivalence for the entirety of a higher- 
order language with local state (Sec. 3). This technique, realised in a tool called 
HOBBIT,’ automatically detects inequivalent program expressions given sufficient 
bounds, and proves hard equivalences automatically or semi-automatically. 

Our technique uses a labelled transition system (LTS) for open expressions 
in order to express equivalence as a bisimulation. The LTS is symbolic both for 
higher-order arguments (Sec. 4), similarly to symbolic game models [8,20] and 
derived proof techniques [3,15], and first-order ones (Sec. 6), adopting established 
techniques (e.g. [6]) and tools such as Z3 [23]. This enables the definition of a fully 
abstract symbolic environmental bisimulation, the bounded exploration of which 
is the task of the HOBBIT tool. Full abstraction guarantees that our tool finds all 
inequivalences given sufficient bounds, and only reports true inequivalences. As 
is corroborated by our experiments, this makes HOBBIT a practical inequivalence 
detector, similar to traditional bounded model checking [2] which has been proved 
an effective bug detection technique in industrial-scale C code [6,7,30]. 

However, while proficient in bug finding, bounded model checking can rarely 
prove the absence of errors, and in our setting prove an equivalence: a bound 
is usually reached before all—potentially infinite—program runs are explored. 
Inspired by hand-written equivalence proofs, we address this challenge by propos- 
ing two key technologies: new bisimulation up-to techniques, and lightweight user 
guidance in the form of state invariant annotations. Hence we increase signifi- 
cantly the number of equivalences proven by HOBBIT, including for example all 
classical equivalences due to Meyer and Sieber [21]. 

Up-to techniques [28] are specific to bisimulation and concern the reduction 
of the size of bisimulation relations, oftentimes turning infinite transition systems 
into finite ones by focusing on a core part of the relation. Although extensively 
studied in the theory of bisimulation, up-to techniques have not been used in 
practice in an equivalence checker. We specifically propose three novel up-to 
techniques: up to separation and up to re-entry (Sec. 5), dealing with infinity in 
the LTS due to the higher-order nature of the language, and up to state invariants 
(Sec. 7), dealing with infinity due to state updates. Up to separation allows us 
to reduce the knowledge of the context the examined program expressions are 
running in, similar to a frame rule in separation logic. Up to re-entry removes the 
need of exploring unbounded nestings of higher-order function calls under specific 
conditions. Up to state invariants allows us to abstract parts of the state and 
make finite the number of explored configurations by introducing state invariant 
predicates in configurations. 

State invariants are common in equivalence proofs of stateful programs, both 
in handwritten (e.g. [16]) and tool-based proofs. In the latter they are expressed 
manually in annotations (e.g. [9]) or automatically inferred (e.g. [14]). In HOBBIT 
we follow the manual approach, leaving heuristics for automatic invariant inference 
for future work. An important feature of our annotations is the ability to express 
relations between the states of the two compared terms, enabled by the up to 


3 Higher Order Bounded Blsimulation Tool (HOBBIT), https://github.com/LaifsV1/Hobbit. 
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state invariants technique. This leads to finite bisimulation transition systems in 
examples where concrete value semantics are infinite state. 

'The above technologies, combined with standard up-to techniques, transform 
HOBBIT from a bounded checker into an equivalence prover able to reason about 
infinite behaviour in a finite manner in a range of examples, including classical 
example equivalences (e.g. all in [21]) and some that previous work on up-to 
techniques cannot algorithmically decide [3] (cf. Ex. 22). We have benchmarked 
HOBBIT on examples from the literature and newly designed ones (Sec. 8). Due 
to the undecidable nature of contextual equivalence, up-to techniques are not 
exhaustive: no set of up-to techniques is guaranteed to finitise all examples. 
Indeed there are a number of examples where the bisimulation transition system 
is still infinite and HOBBIT reaches the exploration bound. For instance, HOBBIT 
is not able to prove examples with inner recursion and well-bracketing properties, 
which we leave to future work. Nevertheless, our approach provides a contextual 
equivalence tool for a higher-order language with state that can prove many 
equivalences and inequivalences which previous work could not handle due to 
syntactic restrictions and other limitations (Sec. 9). 


Related work Our paper marries techniques from environmental bisimulations 
up-to [16,29,28,3] with the work on fully abstract game models for higher-order 
languages with state [18,8,20]. The closest to our technique is that of Biernacki et 
al. [3], which introduces up-to techniques for a similar symbolic LTS to ours, albeit 
with symbolic values restricted to higher-order types, resulting in infinite LTSs in 
examples such as Ex. 21, and with inequivalence decided outside the bisimulation 
by (non-)termination, precluding the use up-to techniques in examples such as 
Ex. 22. Close in spirit is the line of research on logical relations [1,13,15] which 
provides a powerful tool for hand-written proofs of contextual equivalence. Also 
related are the tools HECTOR [12] and CoNEQcT [24], and Sv TECI [14], based 
on game semantics and step-indexed logical relations respectively (cf. Sec. 9). 


2 High-Level Intuitions 


Contextual equivalence requires that two program expressions lead to the same 
observable result in any program context these may be fed in. Instead of working 
directly with this definition, we can translate programs into a semantic model 
that is fully abstract, reducing contextual equivalence to semantic equality. 

The semantic model we use is that of Game Semantics [18]. We model programs 
as formal interactions between two players: a Proponent (corresponding to the 
program) and an Opponent (standing for any program context). Concretely, these 
interactions are sets of traces produced from a Labelled Transition System (LTS), 
the nodes and labels of which are called configurations and moves respectively. 
'The LTS captures the interaction of the program with its environment, which 
is realised via function applications and returns: moves can be questions (i.e. 
function applications) or answers (returns), and belong to proponent or opponent. 
E.g. a program calling an external function will issue a proponent question, while 
the return of the external function will be an opponent answer. In the examples 
that follow, moves that correspond to the opponent shall be underlined. 


From Bounded Checking to Verification of Equivalence 181 


reti ret(() reti ret(() 


oy (D> Pelo: fi) Gon) app( fi; 0) pp, Ppl. f2) Go) app(fo. 0) (T) sb fs) 


Fig. 1. Sample LTS’s modelling expressions in Section 2. 


Example 1. Consider the expression N = (fun f -> f (); 0) of type (unit > 
unit) — int. Evaluating N leads to a function g being returned (i.e. g is Af. (); 0). 
When g is called with some input fi, it will always return 0 but in the process it 
may call the external function fi. The call to fı may immediately return or it 
may call g again (i.e. reenter), and so on. The LTS for N is as in Fig. 1 (top). 


Given two expressions M, N, checking their equivalence will amount to check- 
ing bisimulation equivalence of their (generally infinite) LTS's. Our checking 
routine performs a bounded analysis that aims to either find a finite counterex- 
ample and thus prove inequivalence, or build a bisimulation relation that shows 
the equivalence of the expressions. The former case is easier as it is relatively 
rapid to explore a bisimulation graph up to a given depth. The latter one is 
harder, as the target bisimulation can be infinite. To tackle part of this infinity, 
we use three novel up-to techniques for environmental bisimulation. 

Up-to techniques roughly assert that if a core set of configurations in the 
bisimulation graph explored can be proven to be part of a relation satisfying a 
definition that is more permissive than standard bisimulation, then a superset 
of configurations forms a proper bisimulation relation. This has the implication 
that a bounded analysis can be used to explore a finite part of the bisimulation 
graph to verify potentially infinitely many configurations. As there can be no 
complete set of up-to techniques, the pertaining question is how useful they are 
in practice. In the remainder of this section we present the first of our up-to 
techniques, called up to separation, via an example equivalence. The intuition 
behind this technique comes from Separation Logic and amounts to saying that 
functions that access separate regions of the state can be explored independently. 
As a corollary, a function that manipulates only its own local references may be 
explored independently of itself, i.e. it suffices to call it once. 
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Loc: lk Var:z,y,z Const:c 
Type: T ::=bool | int | unit| T — T | Ty *...* T» 
Exp: EMN i =v | (€) | op(ë) | ee | if ethen eelsee | ref l = vine | UL | l := e | let(#) = e in e 
Val:  u,v::=c] x |fixf(x).e | (v) 
E: =[|]r| (8, E ,€)| op(v, E,€) | Ee|v E|L:- E| if Etheneelsee| let(z) = E ine 
Cxt: D-[]ur | e| (D) | op(D) | DD | l:= D | if D then D else D | fixf(z).D 
| refl = Din D |let(z) = D in D 


(s ; op(2)) < (s;w) if op**"(2) = w 
(sifixf(r.e)o) = (sielv/a][fixf(r).e/f]) 

(s;let(z) = (v) ine) © (s;e[v/z]) 

(s; ref l = vine) + (s[l v]; e) if | Z dom(s) 

(s; M) > (s;v) if s(l) =v 

(s;Li— v) c» (sll v];0) 

(s; if cthen e1 elsee2) — (s;ei) if (c, i) € {(tt, 1), (ff, 2)} 
(s: Ele]) > (s'; E[e]) if (s;e) > (s';e) 


Fig. 2. Syntax and reduction semantics of the language A". 


Example 2. Consider M = (fun f -> ref x = 0 in f (); !x) and N from 
Ex. 1. The LTS corresponding to M and N are shown in Fig. 1 (middle and 
top). Regarding M, we can see that opponent is always allowed to reenter the 
proponent function g, which creates a new reference z,, each time. This makes 
each configuration unique, which prevents us from finding cycles and thus finitise 
the bisimulation graph. Moreover, both the LTS for M and N are infinite because 
of the stack discipline they need to adhere to when O issues reentrant calls. 

With separation, however, we could prune the two LTS's as in Fig. 1 (bottom). 
We denote the configurations after the first opponent call as C4. Any opponent 
call after C leads to a configuration which differs from C4 either by a state 
component that is not accessible anymore and can thus be separated, or by a 
stack component that can be similarly separated. Hence, the LTS's that we need 
to consider are finite and thus the expressions are proven equivalent. 


3 Language and Semantics 


We develop our technique for the language A", a simply typed lambda calculus 
with local state whose syntax and reduction semantics are shown in Fig. 2. Ex- 
pressions (Exp) include the standard lambda expressions with recursive functions 
(fixf (x).e), together with location creation (ref | = vine), dereferencing (!/), and 
assignment (l := e), as well as standard base type constants (c) and operations 
(op(&)). Locations are mapped to values, including function values, in a store (St). 
We write - for the empty store and let fl(x) denote the set of free locations in x. 

The language A"? is simply-typed with typing judgements of the form A; X - 
e : T, where A is a type environment (omitted when empty), X a store typing and 
T a value type (Type); X, is the typing of store s. The rules of the type system are 
standard and omitted here. Values consist of boolean, integer, and unit constants, 
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functions and arbitrary length tuples of values. To keep the presentation of our 
technique simple we do not include reference types as value types, effectively 
keeping all locations local. Exchange of locations between expressions can be 
encoded using get and set functions. In Ex. 22 we show the encoding of a classic 
equivalence with location exchange between expressions and their context. Future 
work extensions to our technique to handle location types can be informed from 
previous work [18,14]. 

'The reduction semantics is by small-step transitions between configurations 
containing a store and an expression, (s;e) — (s' ; e), defined using single-hole 
evaluation contexts (ECxt) over a base relation —. Holes [.]r are annotated with 
the type T' of closed values they accept, which we may omit to lighten notation. 
Beta substitution of x with v in e is written as e|v/z]. We write (s ; e) 4} to denote 
(s;e) —* (t;v) for some t, v. We write X to mean a syntactic sequence, and 
assume standard syntactic sugar from the lambda calculus. In our examples we 
assume an ML-like syntax and implementation of the type system, which is also 
the concrete syntax of HOBBIT. 

We consider environments P € N Ëh Val which map natural numbers to 
closed values. The concatenation of two such environments I, and I>, written 
Tı, T is defined when dom(11) N dom(15) = Ø. We write (*w,,...,'"v,) for a 
concrete environment mapping 21,...,4, to v4,..., v4, respectively. When indices 
are unimportant we omit them and treat /' environments as lists. 

General contexts D contain multiple, non-uniquely indexed holes [.];;7, where 
T is the type of value that can replace the hole. Notation D[I'] denotes the 
context D with each hole [.];;7 replaced with I (i), provided that i € dom(1") and 
X F D'(4) : T, for some X. We omit hole types where possible and indices when all 
holes in D are annotated with the same i. In the latter case we write D[v] instead 
of D[(‘v)] and allow to replace all holes of D with a closed expression e, written 
D|e]. We assume the Barendregt convention for locations, thus replacing context 
holes avoids location capture. Standard contextual equivalence [22] follows. 


Definition 3 (Contextual Equivalence). Expressions - e : T and F e3 : T 
are contextually equivalent, written as ej = e2, when for all contexts D such that 


H Dlei] : unit and F D[es] : unit we have (-; D|e1]) 4 iff (-; D[e2]) 4. 


4 LTS with Symbolic Higher-Order Transitions 


Our Labelled Transition System (LTS) has symbolic transitions for both higher- 
order and first-order transitions. For simplicity we first present our LTS with 
symbolic higher-order and concrete first-order transitions. We develop our theory 
and most up-to techniques on this simpler LTS. We then show its extension with 
symbolic first-order transitions and develop up to state invariants which relies on 
this extension. We extend the syntax with abstract function names o: 


Val: u,v,w :— c|fixf(x).e | (v) | ar 


Abstract function names o are annotated with the type T of function they 
represent, omitted where possible; an(x) is the set of abstract names in x. 
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PropApp : (A; D; K ;s; Elov]) 2°), (Air, r"; EJ], K;s;-) if (D, I’) € ulpatt(v) 


PnoPRET : (A; P; K ;s;v) Æ, (ALT T! SK s;) if (D, I") € ulpatt(v) 
OPAPP : (A; D; K;s;-) see PI), Awa D; K ;s;e) if Xs H r(i): T >T 


and (D, &) € ulpatt(T) 
and I'(i) Dia] > e 


OPRET : 

(A; D; El ]r,K;s;-) PD, (awa; ri Ks; EDIä) if (D, a) € ulpatt(T) 
Tau: (A; P; K;s;e) 5 (A; D; K;s;e') if (s;e) > (s';e) 
RESPONSE : C 5 (L) if nz 


TERM: (A;I;-38;-) > (1) 


Fig. 3. The Labelled Transition System. 


We define our LTS (shown in Fig. 3) by opponent and proponent call and 
return transitions, based on Game Semantics [18]. Proponent transitions are 
the moves of an expression interacting with its context. Opponent transitions 
are the moves of the context surrounding this expression. These transitions are 
over proponent and opponent configurations (A;1';K;s;e) and (A; D; K ;s;-), 
respectively. In these configurations: 


A is a set of abstract function names been used so far in the interaction; 

— T is an environment indexing proponent functions known to opponent;* 

K is a stack of proponent continuations, created by nested proponent calls; 
s is the store containing proponent locations; 

— eis the expression reduced in proponent configurations; ê denotes e or -. 


In addition, we introduce a special configuration (L) which is used in order to 
represent expressions that cannot perform given transitions (cf. Remark 6). We 
let a trace be a sequence of app and ret moves (i.e. labels), as defined in Fig. 3. 
For the LTS to provide a fully abstract model of the language, it is necessary 
that functions which are passed as arguments or return values from proponent to 
opponent be abstracted away, as the actual syntax of functions is not directly 
observable in A'"P, This is achieved by deconstructing such values v to: 


— an ultimate pattern D (cf. [19]), which is a context obtained from v by 
replacing each function in v with a distinct numbered hole; together with 
— an environment I’ mapping indices of these holes to values, and D[I] = v. 


We let ulpatt(v) contain all such pairs (D, I") for v; e.g.: ulpatt((Az.e1,5)) = 
£C] 5), ÉAz.e1] ) | for any i}. We extend ulpatt to types through the use of 
symbolic function names: ulpatt(T) is the largest set of pairs (D, I") such that 
+ D[I] : T, where rng(I) = dz, and D does not contain functions. 


^ thus, T is encoding the environment of Environmental Bisimulations (e.g. [16]) 
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In Fig. 3, proponent application and return transitions (PRoPAPP, PRoPRET) 
use ultimate pattern matching for values and accumulate the functions generated 
by the proponent in the /' environment of the configuration, leaving only their 
indices on the label of the transition itself. Opponent application and return 
transitions (OPAPP, OPRET) use ultimate pattern matching for types to generate 
opponent-generated values which can only contain abstract functions. This elimi- 
nates the need for quantifying over all functions in opponent transitions but still 
includes infinite quantification over all base values. Symbolic first-order values in 
Sec. 6 will obviate the latter. 

At opponent application the following preorder performs a beta reduction when 
opponent applies a concrete function. This technicality is needed for soundness. 


Definition 4 (>). For application vu we write vu > e to mean e = au, when 
v = a; and e = e'[u/z][fixf (x).e' / f], when v = fixf (x).e'. 

In our LTS, C ranges over configurations and 7 over transition labels; 2. means 
Z», when 1 = 7, and 5 otherwise. Standard weak (bi-)simulation follows. 


Definition 5 (Weak Bisimulation). Binary relation R is a weak simulation 
when for all C, R Co and C, 55 Ci, there exists C5 such that C» UR C? and CI R 


Ch. IfR, R7! are weak simulations then R is a weak bisimulation. Similarity (Ez) 
and bisimilarity (x) are the largest weak simulation and bisimulation, respectively. 


Remark 6. Any proponent configuration that cannot match a standard bisimula- 
tion transition challenge can trivially respond to the challenge by transitioning 
into (L) by the Response rule in Fig. 3. By the same rule, this configuration can 
trivially perform all transitions except a special termination transition, labelled 
with |. However, regular configurations that have no pending proponent calls 
(K — -), can perform the special termination transition (TERM rule), signalling 
the end of a complete trace, i.e. a completed computation. This mechanism 
allows us to encode complete trace equivalence, which coincides with contextual 
equivalence [18], as bisimulation equivalence. In a bisimulation proof, if a propo- 
nent configuration is unable to match a bisimulation transition with a regular 
transition, it can still transition to (.L) where it can simulate every transition of 


the other expression, apart from d leading to a complete trace. 

Our mechanism for treating unmatched transitions has the benefit of enabling 
us to use the standard definition of bisimulation over our LTS. This is in contrast 
to previous work [3,15], where termination/non-termination needed to be proven 
independently or baked in the simulation conditions. More importantly, our 
approach allows us to use bisimulation up-to techniques even when one of the 
related configurations diverges, which is not possible in previous symbolic LTSs 
[18,15,3], and is necessary in examples such as Ex. 22. 


Definition 7 (Bisimilar Expressions). Expressions + e1 : T and F e2 : T are 
bisimilar, written eq © e, when (^ ;-;:;:;e3) 2 (iris isi 03). 
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Theorem 8 (Soundness and Completeness). e; ~ e» iff e1 = e». 


As a final remark, the LTS presented in this section is finite state only for a small 
number of trivial equivalence examples. The following section addresses sources 
of infinity in the transition systems through bisimulation up-to techniques. 


5 Up-to Techniques 

We start by the definition of a sound up-to technique. 

Definition 9 (Weak Bisimulation up to f). R is a weak simulation up to f 
when for all Cy R Cz and Cı Z C1, there is Ch with Co => Ch and C1 f(R) Cb. 


IfR, Ro! are weak simulations up to f then R is a weak bisimulation up to f. 


Definition 10 (Sound up-to technique). A function f is a sound up-to 
technique when for any R which is a simulation up to f we have R C (E). 


HOBBIT employs the standard techniques: up to identity, up to garbage 
collection, up to beta reductions and up to name permutations. Here we present 
two novel up-to techniques: up to separation and up to reentry. 


Up to Separation Our experience with HOBBIT has shown that one of the 
most effective up-to techniques for finitising bisimulation transition systems is 
the novel up to separation which we propose here. The intuition of this technique 
is that if different functions operate on disjoint parts of the store, they can be 
explored in disjoint parts of the bisimulation transition system. Taken to the 
extreme, a function that does not contain free locations can be applied only 
once in a bisimulation test as two copies of the function will not interfere with 
each other, even if they allocate new locations after application. To define up to 
separation we need to define a separating conjunction for configurations. 


Definition 11 (Stack Interleaving). Let Kı, Kə be lists of evaluation contexts 
from ECxt (Fig. 2); we define the interleaving operation K #z Kz inductively, 


and write Kı # K3 to mean Kı #g K2 for unspecified k. We let - #.-=- and: 
E1, Kı #a p Ko = Fi, (Ki #5 Ko) Kı #o p Es, Ko = Ea, (Kı fg K2). 


Definition 12 (Separating Conjuction). Let C1 = (A1;113; K1;51;61) and 
Cy = (Ag; I2; K2;52;62) be well-formed configurations. We define: 


def 


— C1 61 Co = (A, U A2; D, T2; Kı Eg Koj 81, 525€1) when êz = - 


def 


— Cy 62 C2 = (A, U Az; Di, Pa; Kı #g K2; 51, 525 €2) when ê =- 


provided dom(s1) N dom(s2) = 0. We let C1 & Co denote di, k. © Qr Co. 
The function sep provides the up to separation technique; it is defined as: 


UpTo® UpTo@Lr UPTO Lr 
CRC ORO, CRU) GRC, ORC GRC) 


Ci Qr C3 sep(R) C» Di C4 C1 ® C3 sep(R) (1) C, © C3 sep(R) (L) 


Soundness follows by extending [28,27] with a weaker, sufficient proof obligation. 
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Lemma 13. Function sep is a sound up-to technique. 


Many example equivalences have a finite transition system when using up to 
separation in conjunction with the simple techniques of the preceding section. 


Example 14. The following is a classic example equivalence from Meyer and 
Sieber [21]. The following expressions are equivalent at type (unit — unit) — unit. 


M = fun f -> ref x = 0 in f () N = fun f -> f () 


For both functions, after initial application of the function by the opponent, 
the proponent calls f, growing the stack K in the two configurations. At that 
point the opponent can apply the same functions again. The LTS of both M 
and N is thus infinite because K can grow indefinitely, and so is a bisimulation 
proving this equivalence. It is additionally infinite because the opponent can keep 
applying the initial function applications even after these return. However, if 
we apply the up-to separation technique immediately after the first opponent 
application, the /' environments become empty, and thus no second application of 
the same functions can happen. The LTS thus becomes trivially small. Note that 
no other up to technique is needed here. HOBBIT applies up-to separation after 
every opponent application transition and explores the configuration containing 
the application expression and the smallest possible I’; this does not lead to 
false-negative (or false-positive) results. 


Example 15. 'This example is due to Bohr and Birkedal [5] and includes a non- 
synchronised divergence. 
M = fun f -> 
ref 11 - false in ref 12 - false in 
f (fun () -> if !11 then bot else 12 := true); 
if !12 then bot else l1 := true 


N= fun f -> f (fun () -» _bot_) 


Note that _bot_ is a diverging computation. This is a hard example to prove using 
environmental bisimulation even with up to techniques, requiring quantification 
over contexts within the proof. However, with up-to separation after the opponent 
applies the initial functions, the /' environments are emptied, thus leaving only 
one application of M and N that needs to be explored by the bisimulation. 
Applications of the inner function provided as argument to f only leads to a small 
number of reachable configurations. HOBBIT can indeed prove this equivalence. 


Up to Proponent Function Re-entry The higher-order nature of A"? and 
its LTS allows infinite nesting of opponent and proponent calls. Although up 
to separation avoids those in a number of examples, here we present a second 
novel up-to technique, which we call up to proponent function re-entry (or simply, 
up to re-entry). This technique has connections to the induction hypothesis in 
the definition of environmental bisimulations in [16]. However up to re-entry 
is specifically aimed at avoiding nested calls to proponent functions, and it is 
designed to work with our symbolic LTS. In combination with other techniques 
this eliminates the need to consider configurations with unbounded stacks K in 
many classical equivalences, including those in [21]. 
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UPTOREENTRY 
Ci = (AiDyiKai51;:) R (A; I2; Ko; 825+) = C2 
Vij, C, A', T1, 15, 81, 85. [(app(i, _) g Ul and 


app(;,C 
lA rosse) eo (A5 D71;:;51;:) and 
4,0 
iho) a diee (A; Diis) 
implies T] = I and I5 = I and s, = sj and 52 = 89| 
a iC) p a iC 
c, EEO) m, eo ALT SET Ka pan se) 


Senn T app(i,C’) 


C2 > (A ; Do; K3, Ko 52; 65) 
(A 5 Ti; Ki, Kx; 81; 61) reent(R) (A; Do ; K}, Ko; 52; e2) 


Fig. 4. Up to Proponent Function Re-entry (omitting rules for .|-configurations). 


Up to re-entry is realised by function reent in Fig. 4. T'he intuition of this up-to 
technique is that if the application of related functions at ? in the I’ environments 
has no potential to change the local stores (up to garbage collection, encoded by 
(=)) or increase the I’ environments, then there are no additional observations to 
be made by nested calls to the i-functions, thus configurations reached by such 
nested calls are added to the relation by this up-to technique. Soundness follows 
similarly to up-to separation. 

In HOBBIT we require the user to flag the functions to be considered for the 
up to re-entry technique. This annotation is later combined with state invariant 
annotations, as they are often used together. Inequivalences found while using 
the up to re-entry and state invariant annotations could be false-negatives due 
to incorrect user annotations. HOBBIT ensures that no such false-negatives are 
reported by re-running discovered inequivalences with these two techniques off. 

Below is an example where the state invariant needed is trivial and up to 
separation together with up to re-entry are sufficient to prove the equivalence. 


Example 16. 
M — ref x = 0 in fun f -> f (); !x N = fun f -> f (); 0 


This is like Ex. 2 except the reference in M is created outside of the function 
body. The LTS for this is as follows. Labels (e;!x1) are continuations. 


a 1a) (e; !a1) 7 v ); (ei loi) 


On On fi (up fal AELG fa (Quee fo, ( De.: fs) " 
ret(()) ret(()) 


Again, the opponent is allowed to reenter g as before. With up-to reentry, however, 
the opponent skips nested calls to g as these do not modify the state. 
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(e; 1x1) (eilzi) 
Qu fi) [y pp( fi, ( D> ret(()) Go ret (0) E 


"m 


e app(g, fa) 


N mirrors the above LTS without the x, reference and with continuation (e; 0). 


6 Symbolic First-Order Transitions 


We extend A'"? constants (Const) with a countable set of symbolic constants 
ranged over by &. We define symbolic environments o ::=- | (k ^ e), c, where ~ 
is either — or Z, and e is an arithmetic expression over constants, and interpret 
them as conjunctions of (in-)equalities, with the empty set interpreted as T. 


Definition 17 (Satisfiability). Symbolic environment a is satisfiable if there 
exists an assignment 6, mapping the symbolic constants of o to actual constants, 
such that óc is a tautology; we then write dF c. 


We extend reduction configurations with a symbolic environment ø, written as 
c (s;e). These constants are implicitly annotated with their type. We modify 
the reduction semantics from Fig. 2 to consider symbolic constants: 


a (s;op(c)) + aA (k = op(é)) F (s; r) if & fresh 
at (s;if «thene; elsee2) > o A (k =tt)F (sie) ifoA(K = tt) is sat. 
at (s;if kthen e; elseeg) > o A (k = ff) - (s;eg) if oA (i = ff) is sat. 


All other reduction semantics rules carry the c. The LTS from Sec. 4 is modified 
to operate over configurations of the form a+ C or - + (1). We let C range over 
both forms of configurations. All LTS rules for proponent transitions simply carry 
the c; rule Tau may increase c due to the inner reduction. Opponent transitions 
generate fresh symbolic constants, instead of actual constants: labels app(i, D[a]) 
and ret(D[d]) in rules OPAPP and OpRet of Fig. 3, respectively, contain D with 
symbolic, instead of concrete constants. We adapt (bi-)simulation as follows. 


Definition 18. Binary r relation R on symbolic configurationa | is a weak simula- 
tion when for all en R Cs and pue m BD 3C, such that C; 2. 2. y and 


CŒ R C! (C.o, Cl.o) is sat. V6. ô E (C.o, C.o) => ôm = ôn 


Lemma 19. (ci - C1) E (o2 - C») iff for all 6 = 01,02 we have 9C, & 9C3. 


Corollary 20 (Soundness, Completeness). (- - C1) E ( - C2) iff C1 E Co. 


'The up-to techniques we have developed in previous sections apply unmodified to 
the extended LTS as the techniques do not involve symbolic constants, with the 
exception of up to beta which requires adapting the definition of a beta move to 
consider all possible 6. The introduction of symbolic first-order transitions allows 
us to prove many interesting first-order examples, such as the equivalence of 
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bubble sort and insertion sort, an example borrowed from HECTOR [12] (omitted 
here, see the HOBBIT distribution). Below is a simpler example showing the 
equivalence of two integer swap functions which, by leveraging Z3 [23], HOBBIT 
is able to prove. 


Example 21. 

M = let swap xy = N = fun xy -> let (x,y) = xy in 
let (x,y) = xy ref x = x in ref y = y in 
in (y, x) x i= Ix - ly; y r= Ix + ly; 

in swap X r= ly - Ix; (Ix, !y) 


7 Up to State Invariants 


The addition of symbolic constants into A"? and the LTS not only allows us to 
consider all possible opponent-generated constants simultaneously in a symbolic 
execution of proponent expressions, but also allows us to define an additional 
powerful up-to technique: up to state invariants. We define this technique in two 
parts: up to abstraction and up to tautology realised by abs and taut.? 


UpTotaut 
(01,0; F C1) R (02,03 F C3) 
UpToabs 01,02, 04, 0» is sat. 
(c1 F C1) R (c9 F C3) 01,02 ^ —(01,05$) is not sat. 


(o1 H C4)|c/ K] abs(R) (a9 H C»)[c/ K] (o1 H Ci) taut(R) (o2 H C2) 


The first function abs allows us to derive the equivalence of configurations by 
abstracting constants with fresh symbolic constants (of the same type) and 
instead prove equivalent the more abstract configurations. The second function 
taut allows us to introduce tautologies into the symbolic environments. These 
are predicates which are valid; i.e., they hold for all instantiations of the abstract 
variables. Combining the two functions we can introduce a tautology I (č) into 
the symbolic environments, and then abstract constants c from the predicate but 
also from the configurations with symbolic ones, obtaining I(#), which encodes 
an invariant that always holds. 

Currently in HOBBIT, up to abstraction and tautology are combined and 
applied in a principled way. Functions can be annotated with the following syntax: 


F = fun x (& | h as Ci[R], ..., In as Cn[F] | 4) -> e 


'The annotation instructs HOBBIT to use the two techniques when opponent 
applies related functions where at least one of them has such an annotation. If 
both functions contain annotations, then they are combined and the same & are 
used in both annotations. The techniques are used again when proponent returns 
from the functions, and proponent calls opponent from within the functions.Ó As 
discussed in Sec. 5, the same annotation enables up to reentry in HOBBIT. 
When HOBBIT uses the above two up-to techniques it 1) pattern-matches 
the values currently in each location /; with the value context C; where fresh 


5 Hossir also implements an up to o-normalisation and garbage collection technique. 
6 Finer-grain control of application of these up-to techniques is left to future work. 
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symbolic constants & are in its holes, obtaining a substitution [C/R]; 2) the up to 
tautology technique is applied for the formula ó|[c/K]; and 3) the up to abstraction 
technique is applied by replacing ¢[¢/#] in the symbolic environment with ¢, and 
the contents of locations l; with C;[&]. 
Example 22. Following is an example by Meyer and Sieber [21] featuring location 
passing, adapted to A'"P where locations are local. 
M = let loc eq loclloc2 = [...] in 
fun q -> ref x = 0 in 
let locx = (fun () -> !x) , (fun v -> x := 
let almostadd 2 locz {w | x as w | w mod 2 == 0} = 
if loc eq (locx,locz) then x := 1 else x := !x +2 
in q almostadd 2; if !x mod 2 = 0 then bot. else () 


N = fun q -> bot. 


In this example we simulate general references as a pair of read-write functions. 
Function loc_eq implements a standard location equality test. The two higher- 
order expressions are equivalent because the opponent can only increase the 
contents of x through the function almostadd 2. As the number of times the 
opponent can call this function is unbounded, the LTS is infinite. However, the 
annotation of function almostadd 2 applies the up to state invariants technique 
when the function is called (and, less crucially, when it returns), replacing the 
concrete value of x with a symbolic integer constant w satisfying the invari- 
ant w mod 2 == 0. This makes the LTS finite, up to permutations of symbolic 
constants. Moreover, up to separation removes the outer functions from the I 
environments, thus preventing re-entrant calls to these functions. Note the up to 
techniques are applied even though one of the configurations is diverging (bot. ). 
This would not be possible with the LTS and bisimulation of [3]. 


8 Implementation and Evaluation 


We implemented the LTS and up-to techniques for A'™P in a tool prototype called 
HOBBIT, which we ran on a test-suite of 105 equivalences and 68 inequivalences— 
3338 and 2263 lines of code for equivalences and inequivalences respectively. 
HOBBIT is bounded in the total number of function calls it explores per path. 
We ran HOBBIT with a default bound of 6 calls except where a larger bound was 
found to prove or disprove equivalence—46 examples required a larger bound, 
and the largest bound used was 348. To illustrate the impact of up-to techniques, 
we checked all files (pairs of expressions to be checked for equivalence) in five 
configurations: default (all up-to techniques on), up to separation off, annotations 
(up to state invariants and re-entry) off, up to re-entry off, and everything off. 
The tool stops at the first trace that disproves equivalence, after enumerating 
all traces up to the bound, or after timing out at 150 seconds. Time taken and 
exit status (equivalent, inequivalent, inconclusive) were recorded for each file; an 
overview of the experiment can be seen in the following table. All experiments 
ran on an Ubuntu 18.04 machine with 32GB RAM, Intel Core i7 1.90GHz CPU, 
with intermediate calls to Z3 4.8.10 to prune invalid internal symbolic branching 
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and decide symbolic bisimulation conditions. All constraints passed to Z3 are of 
propositional satisfiability in conjunctive normal form (CNF). 


default sep. off annot. off ree. off all off 

eq. | 72 | 0 [5.6s] |32 | 0 [1622.9s] 47 | 0 [178.3s] |57 | 0 [177.6s] | 3 | 0 [2098.5s] 

ineq. |0| 68 [20.05] 0 | 66 [312.8s] | 0 | 68 [19.6s] | O | 68 [20.1s] 10 | 65 [515.7s] 
a | b [c] for a (out of 105) equivalences and 

b (out of 68) inequivalences reported taking c seconds in total. 


We can observe that HOBBIT was sound and bounded-complete for our 
examples; no false reports and all inequivalences were identified. Up-to techniques 
also had a significant impact on proving equivalence. With all techniques on, it 
proved 68.6% of our equivalences; a dramatic improvement over 2.9% proven 
with none on. The most significant technique was up-to separation—necessary 
for 55.6% of equivalences proven and reducing time taken by 99.99%—which was 
useful when functions could be independently explored by the context. Following 
was annotations—necessary for 34.7% of equivalences and decreasing time by 
96.9% —and up-to re-entry —20.8% of files and decreased time by 96.8%. Although 
the latter two required manual annotation, they enabled equivalences where our 
language was able to capture the proof conditions. Note that, since turning off 
invariant annotations also turns off re-entry, only 10 files needed up-to re-entry on 
top of invariant annotations. In contrast, inequivalences did not benefit as much. 
This was expected as without up-to techniques HOBBIT is still based on bounded 
model checking, which is theoretically sound and complete for inequivalences, and 
finds the shortest counterexample traces using breadth-first search. Nonetheless, 
with up-to techniques turned off, inequivalences were discovered in 515.7s (vs. 20s 
with techniques on) and three files timed out, due to the techniques reducing the 
size and branching factor of configurations. This suggests that the reduction in 
state space is still relevant when searching for counterexamples. 


9 Comparison with Existing Tools 


There are two main classes of tools for contextual equivalence checking. The first 
one includes semantics-driven tools that tackle higher-order languages with state 
like ours. In this class belong game-based tools HECTOR [12] and CoNEQcT [24], 
which can only address carefully crafted fragments of the language, delineated by 
type restrictions and bounded data types. The most advanced tool in this class 
is SYTECI [14], which is based on logical relations and removes a good part of 
the language restrictions needed in the previous tools. The second class concerns 
tools that focus on first-order languages, typically variants of C, with main tools 
including REVE [9], SYMDiFF [17] and RVT [11]. These are highly optimised 
for handling internal loops, a problem orthogonal to handling the interactions 
between higher-order functions and their environment, addressed by HOBBIT and 
related tools. We believe the techniques used in these tools may be useful when 
adapted to HOBBIT, which we leave for future work. 

In the higher-order contextual equivalence setting, the most relevant tool to 
compare with HOBBIT is SY TECTI. This is because Sv TECI supersedes previous 
tools by proving examples with fewer syntactical limitations. We ran the tools on 
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examples from both Sv'TECrTs and our own benchmarks—7 and 15 equivalences, 
and 2 and 7 inequivalences from SyTECI and HOBBIT respectively—with a 
timeout of 150s and using Z3. Unfortunately, due to differences in parsing 
and SY TECTSs syntactical restrictions, the input languages were not entirely 
compatible and only few manually translated programs were chosen. 


Sy TeCi Hobbit 
Sy TeCi eq. examples 3 | 0 | 4 (0.03s) 1| 0 |6 (<0.01s) 
Hobbit eq. examples 8 | 0 |7 (0.4s) 15 | 0 | 0 (<0.01s) 
SyTeCi ineq. examples 0 | 2 | 0 (0.06s) 0 | 2 | 0 (0.02s) 
Hobbit ineq. examples 2 | 3 | 2 (0.52s) 0 | | 0 (0.45s) 
a | b | c (d) for a eq's, b ineq's and c inconclusive's reported taking d sec in total 


We were unable to translate many of our examples because of restrictions 
in the input syntax supported by SY TECI. Some of these restrictions were 
inessential (e.g. absence of tuples) while others were substantial: the tool does not 
support programs where references are allocated both inside and outside functions 
(e.g. Ex. 15), or with non-synchroniseable recursive calls. Moreover, Sv TECI relies 
on Constrained Horn Clause satisfiability which is undecidable. In our testing 
SYTECI sometimes timed out on examples; in private correspondence with its 
creator this was attributed to Z3's ability to solve Constrained Horn Clauses. 
Finally, SY T'ECI was sound for equivalences, but not always for inequivalences as 
can be seen in the table above; the reason is unclear and may be due to bugs. On 
the other hand, Sv TECI was able to solve equivalences we are not able to handle; 
e.g. synchronisable recursive calls and examples with well-bracketing properties. 


10 Conclusion 


Our experience with HOBBIT suggests that our technique provides a significant 
contribution to verification of contextual equivalence. In the higher-order case, 
HOBBIT does not impose language restrictions as present in other tools. Our 
tool is able to solve several examples that can not be solved by SY TECI, which 
is the most advanced tool in this family. In the first-order case, the problem of 
contextual equivalence differs significantly as the interactions that a first-order 
expression can have with its context are limited; e.g. equivalence analyses do not 
need to consider callbacks or re-entrant calls. Moreover, the distinction between 
global and local state is only meaningful in higher-order languages where a 
program phrase can invoke different calls of the same function, each with its own 
state. Therefore, tools for first-order languages focus on what in our setting are 
internal transitions and the complexities arising from e.g. unbounded datatypes 
and recursion, whereas we focus on external interactions with the context. 

As for limitations, HOBBIT does not handle synchronised internal recursion 
and well-bracketed state, which SYTECI can often solve. More generally, HOBBIT 
is not optimised for internal recursion as first-order tools are. In this work we 
have also disallowed reference types in A"? to simplify the technical development; 
location exchange is encoded via function exchange (cf. Ex. 22). We intend to 
address these limitations in future work and explore applications of HOBBIT to 
real-world examples. 
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Abstract. Motivated by proof checking, we consider the problem of efficiently 
establishing equivalence of propositional formulas by relaxing the completeness 
requirements while still providing certain guarantees. We present a quasilinear 
time algorithm to decide the word problem on a natural algebraic structures we call 
orthocomplemented bisemilattices, a subtheory of Boolean algebra. The starting 
point for our procedure is a variation of Aho, Hopcroft, Ullman algorithm for 
isomorphism of trees, which we generalize to directed acyclic graphs. We combine 
this algorithm with a term rewriting system we introduce to decide equivalence of 
terms. We prove that our rewriting system is terminating and confluent, implying 
the existence of a normal form. We then show that our algorithm computes this 
normal form in log linear (and thus sub-quadratic) time. We provide pseudocode 
and a minimal working implementation in Scala. 


1 Introduction 


Reasoning about propositional logic and its extensions is a basis of many verification 
algorithms [19]. Propositional variables may correspond to, for example, sub-formulas 
in first-order logic theories of SMT solvers [2,5,26], hypotheses and lemmas inside proof 
assistants [13,27,32], or abstractions of sets of states. In particular, it is often of interest 
to establish that two propositional formulas are equivalent. The equivalence problem 
for propositional logic is coNP-complete as a negation of propositional satisfiability [8]. 
From proof complexity point of view [18] many known proof systems, including (non- 
extended) resolution [31] and cutting planes [29] have exponential-sized shortest proofs 
for certain propositional formulas. SAT and SMT solvers rely on DPLL-style algorithms 
[9,10] and do not have polynomial run-time guarantees on equivalence checking, even if 
formulas are syntactically close. Proof assistants implement such algorithms as tactics, 
so they have similar difficulties. A consequence of this is that implemented systems may 
take a very long time (or fail to acknowledge) that a large formula is equivalent to its 
minor variant differing in, for example, reordering of internal conjuncts or disjuncts. 
Similar situations also arise in program verifiers [12,21,30,34,35], where assertions act 
as lemmas in a proof. 


* We acknowledge the financial support of the Swiss National Science Foundation project 
200021. 197288 “A Foundational Verifier”. 
©The Author(s) 2022 


© The Author(s) 2022 
D. Fisman and G. Rosu (Eds.): TACAS 2022, LNCS 13244, pp. 196-214, 2022. 
https: //doi.org/10.1007/978-3-030-99527-0_11 


Equivalence Checking for Orthocomplemented Bisemilattices in Log-Linear Time 197 


It is thus natural to ask for an approximation of the propositional equivalence prob- 
lem: can we find an expressive theory supporting many of the algebraic laws of Boolean 
algebra but for which we can still have a complete and efficient algorithm for formula 
equivalence? By efficient, we mean about as fast, up to logarithmic factors, as the simple 
linear-time syntactic comparison of formula trees. 

We can use such an efficient equivalence algorithm to construct more flexible proof 
systems. Consider any sound proof system for propositional logic and replace the notion 
of identical sub-formulas with our notion of fast equivalence. For example, the axiom 
schema p > (q > p) becomes p > (q — p’) for all equivalent p and p'. The new system 
remains sound. It accepts all the previously admissible inference steps, but also some 
new ones, which makes it more flexible. 


L1: xUy=yux LI’: XAY=yYAx 
L2:xU(QvUz)=(xUy)Uz} L2: xA(yAz) 2(XAy)Az 
L3: xUx=x L3’: XxAX=X 

L4: xul=1 L4: XA020 

L5: xu0=x L5': xAl=x 

L6: 7X =X L6’: same as L6 

L7: xli^x-1 L7: XA^?x-20 

L8: -cxlly)2^2xA^y L8: 7A(xAy)= 7xLay 


Table 1. Laws of an algebraic structures (S,A,U,0,1,7). Our algorithm is complete (and log- 
linear time) for structures that satisfy laws L1-L8 and L1'-L8'. We call these structures orthocom- 
plemented bisemilattices (OCBSL). 


L9: xLi(xAy)=x L9': XA(xlly-x 
L10: xLi(yAz) -(xuy)A^(xuz) | LIC: xA(yllz) 2 (x Ay) Ll (x Az) 


Table 2. Neither the absorption law L9,L9' nor distributivity L10,L10’ hold in OCBSL. Without 
L9,L9', the operations ^ and LI induce different partial orders. If an OCBSL satisfies L10,L10’ 
then it also satisfies L9,L9' and is precisely a Boolean algebra. 


1.1 Problem Statement 


This paper proposes to approximate propositional formula equivalence using a new al- 
gorithm that solves exactly the word problem for structures we call orthocomplemented 
bisemilattices (axiomatized in Table 1), in only log-linear time. In general, the word 
problem for an algebraic theory with signature S and axioms A is the problem of de- 
termining, given two terms f; and f, in the language of S with free variables, whether 
t4 = t, is a consequence of the axioms. Our main interest in the problem is that ortho- 
complemented bisemilattices (OCBSL) are a generalisation of Boolean algebra. This 
structure satisfies a weaker set of axioms that omits the distributivity law as well as its 
weaker variant, the absorption law (Table 2). Hence, this problem is a relaxation “up 
to distributivity" of the propositional formula equivalence. A positive answer implies 
formulas are equivalent in all Boolean algebras, hence also in propositional logic. 
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Definition 1 (Word Problem for Orthocomplemented Bisemilattices). Consider the 
signature with two binary operations ^, U, unary operation ^ and constants, 0, 1. The 
OCBSL-word problem is the problem of determining, given two terms t, and t» in this 
signature, containing free variables, whether t, = t, is a consequence (in the sense 
of first-order logic with equality) of the universally quantified axioms LI-L8,L1'-L8' in 
Table 1. 


Contribution. We present an O(n log?(n)) algorithm for the word problem of orthocom- 
plemented lattices. In the process, we introduce a confluent and terminating rewriting 
system for OCBSL on terms modulo commutativity. We analyze the algorithm to show 
its correctness and complexity. We present its executable description and a Scala imple- 
mentation at https://github.com/epfl-lara/OCBSL. 


1.2 Related Work 


The word problem on /attices has been studied in the past. The structure we consider 
is, in general, not a lattice. Whitman [33] showed decidability of the word problem on 
free lattices, essentially by showing that the natural order relation on lattices between two 
words can be decided by an exhaustive search. The word problem on orthocomplemented 
lattices has been solved typically by defining a suitable sequent calculus for the order 
relation with a cut rule for transitivity [4, 17]. Because a cut elimination theorem can be 
proved similarly to the original from Gentzen [11], the proof space is finite and a proof 
search procedure can decide validity of the implication in the logic, which translates to 
the original word problem. 

The word problem for free lattices was shown to be in PTIME by Hunt et al. [15] 
and the word problem for orthocomplemented lattices was shown to be in PTIME by 
Meinander [25]. Those algorithms essentially rely on similar proof-search methods as 
the previous ones, but bound the search space. These results make no mention of a spe- 
cific degree of the polynomial; our analysis suggest that, as described, these algorithms 
run in O(n*). Related techniques of locality have been applied more broadly and also 
yield polynomial bounds, with the specific exponents depending on local Horn clauses 
that axiomatize the theory [3, 24]. 

Aside from the use in equivalence checking, the problem is additionally of indepen- 
dent interest because OCBSL are a natural weakening of Boolean Algebra and ortho- 
complemented lattices. They are dual to complemented lattices in the sense illustrated 
by Figure 1. A slight weakening of OCBSL, called de Morgan bisemilattice, has been 
used to simulate electronic circuits [6,22]. OCBSL may be applicable in this scenario 
as well. Moreover, our algorithm can also be adapted to decide, in log-linear time, the 
word problem for this weaker theory. 

To the best of our knowledge, no solution was presented in the past for the word 
problem for orthocomplemented bisemilattices (OCBSL). Moreover, we are not aware 
of previous log-linear algorithms for the related previously studied theories either. 


1.3 Overview of the Algorithm 


It is common to represent a term, like a Boolean formula, as an abstract syntax tree. 
In such a tree, a node corresponds to either a function symbol, a constant symbol or a 
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variable, and the children of a function node represent the arguments of the function. In 
general, for a symbol function f, trees f(x, y) and f(y, x) are distinct; the children of a 
node are stored in a specific order. Commutativity of a function symbol f corresponds to 
the fact that children of a node labelled by f are instead unordered. Our algorithm thus 
uses as its starting point a variation of the algorithm of Aho, Hopcroft, and Ullman [14] 
for tree isomorphism, as it corresponds to deciding equality of two terms modulo com- 
mutativity. However, the theory we consider contains many more axioms than merely 
commutativity. Our approach is to find an equivalent set of reduction rules, themselves 
understood modulo commutativity, that is suitable to compute a normal form of a given 
formula with respect to those axioms using the ideas of term rewriting [1]. The interest 
of tree isomorphism in our approach is two-fold: first, it helps to find application cases 
of our reduction rules, and second, it compares the two terms of our word problem. In 
the final algorithm, both aspects are realized simultaneously. 


ct 


(a) Complemented lattice (b) Orthocomplemented bisemilattice(c) Orthocomplemented lattice 


Fig. 1. Bisemilattices satisfying absorption or de Morgan laws. 


2 Preliminaries 


2.1 Lattices and Bisemilattices 


To define and situate our problem, we present a collection of algebraic structures satis- 
fying certain subsets of the laws in tables 1 and 2. 

A structure (S, A) that is associative (L1), commutative (L2) and idempotent (L3) is 
a semilattice. A semilattice induces a partial order relation on S defined by a < b <=> 
(a ^b) = a. Indeed, one can verify that dc.(bAc) = a <> (bAa) = a, from which tran- 
sitivity follows. Antisymetry is immediate. In such partially ordered set (poset) S, two 
elements a and b always have a greatest lower bound, or glb, a ^ b. Conversely, a poset 
such that any two elements have a g/b is always a semilattice. A structure (S, A, 0, 1) that 
satisfies L1, L2, L3, L4, and L5 is a bounded upper-semilattice. Equivalently, 1 is the 
maximum element and 0 the minimum element in the corresponding poset. Similarly, 
a structure (,S,LI, 0, 1) that satisfies L1’ to L5’ is a bounded lower-semilattice. In that 
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case, we write the corresponding ordering relation J. Note that it points in the direc- 
tion opposite to <, so that 1 is always the “maximum” element and 0 the “minimum” 
element. A structure (S, A,LI) is a bisemilattice if (S; ^) is an upper semilattice and 
(S, LI) a lower semilattice. There are in general no specific laws relating the two semi- 
lattices of a bisemilattice. They can be the same semilattice or completely different. If 
the bisemilattice satisfies the absorption law (L9), then the two semilattices are related 
in such a way that a < b €» a J b, i.e. the two orders < and J are equal and the 
structure is called a lattice. A bisemilattice is consistently bounded if both semilattices 
are bounded and if 0, = 0,, = O and 1, = 1,, = 1, which will be the case in this 
paper. A structure (S, A, LI, 5, 0, 1) that satisfies L1 to L7 and L1’ to L7' is called a com- 
plemented bisemilattice, with complement operation ^. A complemented bisemilattice 
satisfying de Morgan's Law (L8 and L8") is an orthocomplemented bisemilattice and 
implies 20 = 2(21A0) = =71 U70 = 1. A structure satisfying L1-L9 and L1’-L9’ is an 
orthocomplemented lattice. Both de Morgan laws (L8, L8’) and absorption laws (L9 
and L9") relate the two semilattices, in a way summarised in Figure 1. In bisemilattices, 
orthocomplementation is (merely) equivalent to a < b € ^b 3 ^a. Indeed, we have: 


def L8' def 
axbe»a^bz-a €» «al b - a €» ob da 


In the presence of L1-L8,L1'-L8', the law of absorption (L9 and L9") is implied 
by distributivity. In fact, an orthocomplemented bisemilattice with distributivity is a 
lattice and even a Boolean algebra. In this sense, we can consider orthocomplemented 
bisemilattices as "Boolean algebra without distributivity". 


2.0 Term Rewriting Systems 


We next review basics of term rewriting systems. For a more complete treatment, see [1]. 


Definition 2. A term rewriting system is a list of rewriting rules of the form e, = e, 
with the meaning that the occurence of e, in a term t can be replaced by e,. e, and e, 
can contain free variables. To apply the rule, e; is unified with a subterm of t, and that 
subterm is replaced by e, with the same unifier. If applying a rewriting rule to t, yields 


* 
to, we say that t, reduces to t, and write t, > t5. We denote by > the transitive closure 


* 
of — and by e its transitive symmetric closure. 


An axiomatic system such as L1-L9, L1’-L9’ induces a term rewriting system, inter- 


preting equalities from left to right. In that case f, S t, coincides with the validity of 
the equality t, = t, in the theory given by the axioms [1, Theorem 3.1.12]. 


Definition 3. A term rewriting system is terminating if there exists no infinite chain of 
reducing terms tj > t5 > t4 > ... 


Fact 1 /fthere is a well-founded order < (or, in particular, a measure m) on terms such 
that ti > ty = ty < t4 (or, in particular m(t) < m(t,)) then the term rewriting 
system is terminating. 
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Definition 4. r term rawing system is confluent iff: for all ty, tz, t, t 5 At a 


implies 3t4.t» = t4 At 5 ty. 


Theorem 1 (Church-Rosser Property ). [1, Chapter 2]A ter rewriting system is 
confluent if and only if Vt,,t».(ti $5 to) = (tzt 5 tz A f, = > 13). 


A terminating and confluent term rewriting system directly implies decidability of 
the word problem for the underlying structure, as it makes it possible to compute the 
normal form of two terms to check if they are equivalent. Note that commutativity is not 
a terminating rewriting rule, but similar results holds if we consider the set of all terms, 
as well as rewrite rules, modulo commutativity [1, Chapter 11], [28]. To efficiently ma- 
nipulate terms modulo commutativity and achieve log-linear time, we will employ an 
algorithm for comparing trees with unordered children. 


3 Directed Acyclic Graph Equivalence 


The structure of formulas with commutative nodes correspond to the usual mathematical 
definition of a labelled rooted tree, i.e. an acyclic graph with one distinguished vertex 
(root) where there is no order on the children of a node. For this reason, we use as our 
starting point the algorithm of Hopcroft, Ullman and Aho for tree isomorphism [14, Page 
84, Example 3.2], which has also been studied subsequently [7, 23]. 

To account for structure sharing, we further generalize this representation to singly- 
rooted, labeled, Directed Acyclic Graphs, which we simply call DAGs. Our DAGs gener- 
alize rooted directed trees. Any DAG can be transformed into a rooted tree by duplicating 
subgraphs corresponding to nodes with multiple parent, as in Figure 2. This transforma- 
tion in general results in an exponential blowup in the number of nodes. Dually, using 
DAGs instead of trees can exponentially shrink space needed to represent certain terms. 


OP, A 


Fig. 3. Two equivalent DAGs with different 
Fig. 2. A DAG and the corresponding Tree number of nodes. 


Checking for equality between ordered trees or DAGs is easy in linear time: we 
simply recursively check equality between the children of two nodes. 


Definition 5. Two ordered nodes v and x with children 7o, ..., t, and zo, ..., 7, are 
equivalent (noted v ~ m) iff 


label(t) = label(z), m = n and Vi < n, v; ~ T; 
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For unordered trees or DAG, the equivalence checking is less trivial, as the naive al- 
gorithm has exponential complexity due to the need to find the adequate permutation. 


Definition 6. Two unordered nodes v and x with children Tọ, ..., v,, and Ho, ..., T, are 
equivalent (noted v ~ m) iff 


label(r) = label(z), m = n and there exists a permutation p s.t. Vi < n, T pi) ~ Ti 


For trees, note that this definition of equivalence corresponds exactly to isomor- 
phism. It is known that DAG-isomorphism is GI-complete, so it is conjectured to have 
complexity greater than PTIME. Fortunately, this does not prevent our solution because 
our notion of equivalence on DAGs is not the same as isomorphism on DAGs. In partic- 
ular, two DAGs can be equivalent without having the same number of nodes, i.e. without 
being isomorphic, as Figure 3 illustrates. 


Algorithm 1: Unordered DAG equivalence. The operator ++ is concatenation. 
input : two unordered DAGs r and x 
output: True if 7 and z are equivalent, False else. 
1 codes —HashMapl|(String, List[Int]), Int]; 
2 map «-HashMap[Node, Int]; 
3 s, : List — ReverseTopologicalOrder(z); 
4 s, : List — ReverseTopologicalOrder(z); 
5 for (n: Node in s,**5s,) do 
6 l, € [map(c) for c in children(n)]; 
7 r, < (label(n), sort(l,)); 
8 if codes contains r then 
9 | map(n) — codes(r,,); 


10 else 

11 codes(r,) — codes.size; 
12 map(n) — codes(r,,); 

13 end 

14 end 


15 return map(t) == map(z) 


Algorithm 1 is the generalization of Hopcroft, Ullman and Aho's algorithm. It de- 
cides in log-linear time if two labelled (unordered) DAGs are equivalent according to 
definition 5. The algorithm generalizes straightforwardly to DAGs with a mix of ordered 
and unordered nodes: if a node is ordered, we skip the sorting operation in line 7. 

The algorithm works bottom to top. We first sort the DAG in reverse topological 
order using, for example, Kahn’s algorithm [16]. This way, we explore the DAG starting 
from a leaf and finishing with the root. It is guaranteed that when we treat a node, all its 
children have already been treated. 

The algorithm recursively assigns codes to the nodes of both DAGs recursively. In 
the unlabelled case: 
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e The first node, necessarily a leaf, is assigned the integer O 

e The second node gets assigned O if it is a leaf or 1 if it is a parent of the first node 

e For any node, the algorithm makes a list of the integer assigned to that node's chil- 
dren and sort it (if the node is commutative). We call this the signature of the node. 
Then it checks if that list has already been seen. If yes, it assigns to the node the 
number that has been given to other nodes with the same signature. Otherwise, it 
assigns a new integer to that node and its signature. 


Lemma 1 (Algorithm 1 Correctness). The codes assigned to any two nodes n and m 
of s,++s, are equal if and only if n ~ m. 


Proof. Let n and m denote any two DAG nodes. By induction on the height of n: 


— In the case where n is a leaf, we have r, = (label(n), Nil). Note that for any node 
n, map(n) = codes(r,). Since every time the map codes is updated, it is with a 
completely new number, codes(r,) = codes(r,,) if and only if r, = r,, i.e. iff 
label(m) — label(n) and m has no children (like n). 

— In the case where n has children n;, again codes(r,) = codes(r,,) if and only if 
Fm = Fp, Which is equivalent to (label(m) = label(n) and sort(l,,) = sort(1,,). This 

means this means there is a permutation of children of n such that Vi, codes(n p = 

codes(m;). By induction hypothesis, this is equivalent to Vi, ny;; ~ m;. Hence we 
find that map(n) = map(m) if and only if both: 


1. Their labels are equal 
2. There exist a permutation p s.t. ny ~ m; 


i.e n and m have the same code if and only if n ~ m. 


Corollary 1. The algorithm returns True if and only if 1 ~ x. 


Time Complexity. Using Kahn's algorithm, sorting c and z is done in linear time. Then 
the loop touches every node a single time. Inside the loop, the first line takes linear time 
with respect to the number of children of the node and the second line takes log-linear 
time with respect to the number of children. Since we use HashMaps, the last instructions 
take effectively constant time (because hash code is computed from the address of the 
node and not its content). 

So for general DAG, the algorithm runs in time at most log-quadratic in the number 
of nodes. Note however that for DAGs with bounded number of children per node as well 
as for DAGs with bounded number of parents per nodes, the algorithm is log-linear. In 
fact, the algorithm is log-linear with respect to the total number of edges in the graph. 
For this reason, the algorithm is still only log-linear in input size. It also follows that 
the algorithm is always at most log-linear with respect to the tree or formula underlying 
the DAG, which may be much larger than the DAG itself. Moreover, there exists cases 
where the algorithm is log-linear in the number of nodes, but the underlying tree is 
exponentially larger. The full binary symmetric graph is such an example. 
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4 Word Problem on Orthocomplemented Bisemilattices 


We will use the previous algorithm for DAG equivalence applied to a formula in the 
language of bisemilattices (S, A, LI) to account for commutativity (axioms L1, L1’), but 
we need to combine it with the remaining axioms. From now on we work with axioms 
L1-L8, L1'-L8' in Table 1. The plan is to express those axioms as reduction rules. Of 
rules L2-L8 and L2’-L8’, all but L8 and L8' reduce the size of the term when applied 
from left to right, and hence seem suitable as rewrite rules. 

It may seem that the simplest way to deal with de Morgan law is to use it (along 
with double negation elimination) to transform all terms into negation normal form. It 
happens, however, that doing this causes troubles when trying to detect application cases 
of rule L7 (complementation). Indeed, consider the following term: 


f — (aA b)LI (a ^ b) 


Using complementation it clearly reduces to 1, but pushing into negation-normal form, 
it would first be transformed to (a ^ b) LI (^a V =b). To detect that these two disjuncts 
are actually opposite requires to recursivly verify that =(a ^ b) = (^a v =b). 

It is actually simpler to apply de Morgan law the following way: 


X ^y — (7x Uy) 


Instead of removing negations from the formula, we remove one of the binary semilattice 
operators. (Which one we keep is arbitrary; we chose to keep LI.) Now, when we look if 
rule L7 can be applied to a disjunction node (i.e. two children y and z such that y = =z), 
there are two cases: if x is not itself a negation, i.e. it starts with Ll, we compute ^x code 
from the code of x in constant time. If x = ax’ then ax ~ x’ so the code of ^x is simply 
the code of x’, in constant time as well. Hence we obtain the code of all children and 
their negation and we can sort those codes to look for collisions, all of it in time linear 
in the number of children. 
We now restate the axioms L1-L8 ,L1’-L8’ in this updated language in Table 3. 


Al: LIC. 25. x. e) LIC... xj xj. ae) Al’ : alx, nay) = 5 L[GQy, 2x) 

A2 : UE, LIO) = UG. ») AQ’ : a LX, 2L JG) = 2 LIX, 9) 
(x) =x 

A3 : L|(x, x. ») = UK y) A3' : a Lx, 9x, =y) = 5 L[QGx. 3») 

AA : LJd,x) 2 1 A4' : 23LJC0, 5») = 0 

AS : [(0. 35 = US AS! : ^ [J(o1, 23) = ^ LIC) 

A6: 74x =x 

A7 : L|(x, 7x, y) = 1 AT : aL |x, 29x, 3) = 0 

A8 : aL ](x),...x;) = 5 L [92x 7x) | AB’ : 9 |] (x), ...4x;) = LO ...7%;) 


Table 3. Laws of algebraic structures (S, U, 0, 1, =), equivalent to L1-L8, L1-L8’ under de Morgan 
transformation. 


It is straightforward and not surprising that axiom A8 as well as A1'-A8' all follow 
from axioms A1-A7, so A1-A7 are actually complete for our theory. 
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4. Confluence of the Rewriting System 


In our equivalence algorithm, A1 is taken care of by the arbitrary but consistent ordering 
of the nodes. Axioms A2-A7 form a term rewriting system. Since all those rules reduce 
the size of the term, the system is terminating in a number of steps linear in the size of 
the term. We will next show that it is confluent. We will thus obtain the existence of 
a normal form for every term, and will finally show how our algorithm computes that 
normal form. 


Definition 7. Consider a pair of reduction rules lg — rg and 1, > r with disjoint sets 
of free variables such that ly = D|s], s is not a variable and o is the most general unifier 
of os = ol,. Then (org, (c D)[ors]) is called a critical pair. 


Informally, a critical pair is a most general pair of term (with respect to unification) 
(ti, t2) such that for some fo, tg — f; and tg > fy via two “overlapping” rules. They are 
found by matching the left-hand side of a rule with a non-variable subterm of the same 
or another rule. 


Example I (Critical Pairs). 


1. Matching left-hand side of A6 with the subterm ^x of rule A7, we obtain the pair 
(| Jox x5» 


which arises from reducing the term tọ = | |(^x, ^x, y) in two different ways. 
2. Matching left-hand sides of A2 and A7 gives 


qe.» -L o». 


which arise from reducing | |(X, UO), ^ | JO) using A2 or A7. 
3. Matching left-hand sides of A5 and A7 gives 


(70, 1) 
which arise from reducing 0 LI ^0 in two different ways. 


Proposition 1 ([1, Chapter 6]). A terminating term rewriting system is confluent if and 
* * 
only if all critical pairs (t,,t5) are joinable i.e. 3t3. t} > t4 ^ t5 > ts. 


In the first of the previous examples, the pair is clearly joinable by commutativity and 
a single application of rule A7 itself. The second example is more interesting. Observe 
that | |G, Y, ^ LIO) = 1 is a consequence of our axiom, but the left part cannot be 
reduced to 1 in general in our system. To solve this problem we need to add the rule A9: 
UG, X, 2 LIO = 1. Similarly, the third example forces us to add A10: ^0 = 1 to our 
set of rules. From A10 and A6 we then find the expected critical pair A11: 21 = 0. 
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Al: aee Dr are = [oat 
A2 : UE. UGY = UG. X 
Uœ) =x 

A3 : U, x. ») = L|(x, » 
A4 : LJ]. 39 2 1 

AS : L|(, x) = L ]CO 

A6 ;: ax =x 

AT: L|(x,7x, y) 21 

A9 : []G. ^ LIG) = 1 
A10 : 2021 
Al1:-2120 


Table 4. Terminating and confluent set of rewrite rules equivalent to L1-L8, L1'-L8* 


4.2 Complete Terminating Confluent Rewrite System 


The analysis of all possible pairs of rules to find all critical pairs is straightforward. It 
turns out that the A9, A10 and A11 are the only rules we need to add to our system to 
obtain confluence. We have checked the complete list of critical pairs for rules A2-A11 
(we omit the details due to lack of space). All those pairs are joinable, i.e. reduce to the 
same term, which implies, by Proposition 1, that the system is confluent. Table 4 shows 
the complete set of reduction rules (as well as commutativity). 

Since the system A2-A11 considered over the language (S, U, 2,0, 1) modulo com- 
mutativity of | | is terminating and confluent, it implies the existence of a normal form 
reduction. For any term f, we note its normal form f|. In particular, for any two terms 


t, and t5, we have t, = t, in our theory iff t4 $5 ty iff tı} and t5| are equivalent terms 
modulo commutativity. We finally reach our conclusion: an algorithm that computes 
the normal form (modulo commutativity) of any term gives a decision procedure for the 
word problem for orthocomplemented bisemilattices. 


5 Algorithm and Complexity 


The rewriting system readily gives us a quadratic algorithm. Indeed, using our base 
algorithm for DAG equivalence, we can check, in linear time, for application cases of 
any one of rewriting rules A2-A11 of Table 4 modulo commutativity. Since a term can 
only be reduced up to n times, the total time spent before finding the normal form of a 
term is at most quadratic. It is however possible to find the normal form of a term in a 
single pass of our equivalence algorithm, resulting in a more efficient algorithm. 


5. Combining Rewrite Rules and Tree Isomorphism 


We give an overview on how to combine rules A2-A7, A9, A10, A11 within the tree 
isomorphism algorithm, which we present using Scala-like ! pseudo code in Figure 7. 


! https://www.scala-lang.org/ 
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For conciseness, we omit the dynamic programming optimizations allowed by struc- 
ture sharing in DAGs (which would store the normal form and additionally check if a 
node was already processed.) For each rule, we indicate the most relevant lines of the 
algorithm in Figure 7. 


A2 (Associativity, Lines 10, 20, 32, 42) When analysing a LI node, after the recursive 
call, find all children that are | | themselves and replace them by their own children. 
This is simple enough to implement but there is actually a caveat with this in term of 
complexity. We will come back to it in section 5. 


A3 (Idempotence, Lines 8, 31, 35 ) This corresponds to the fact that we eliminate du- 
plicate children in disjunctions. When reaching a | | node, after having sorted the code 
of its children, remove all duplicates before computing its own code. 


A4, AS (Bounds, Lines 8, 31, 35, 11, 36) To account for those axioms, we reserve a 
special code for the nodes 1 and 0. For A4, when we reach some LI node, if it has 1 as 
one of its children, we accordingly replace the whole node by 1. For A5, we just remove 
nodes with the same codes as 0 from the parent node before computing its own code. 


A6 (Involution, Lines 17, 22) When reaching a negation node, if its child is itself a 
negation node, replace the parent node by its grandchildren before assigning it a code. 


A7 (Complement, Lines 11, 36) As explained earlier, our representation of nodes let us 
do the following to detect cases of A7: First remember that we already applied double 
negation elimination, so that two "opposite" nodes cannot both start with a negation. 
Then we can simply separate the children between negated and non-negated (after the 
recursive call), sort them using their assigned code and look for collisions. 


A9 (Also Complement, Lines 11, 36) This rule is slightly more tricky to apply. When 
analysing a | | node x, after computing the code of all children of x, find all children of 
the form ^| |. For every such node, take the set of its own children and verify if it is 
a subset of the set of all children of x. If yes, then rule A9 applies. Said otherwise, we 
look for collisions between grandchildren (through a negation) and children of every | | 
node. 


A10, A11 (Identities, Lines 17, 26) These rules are simple. In a ^ node, if its child has 
the same code as 0 (resp 1), assign code 1 (resp 0) to the negated node. 


5.2 Case of Quadratic Runtime for the Basic Algorithm 


All the rules we introduced in the previous section into Algorithm | take time (log)linear 
in the number of children of a node to apply, which is not more than the time we spent in 
the DAG/tree isomorphism algorithm. For A3, checking for duplicates is done in linear 
time in an ordered data structure. A4 and A5 (Bounds) consist in searching for specific 
values, which take logarithmic time in the size of the list. A6 (Involution) takes constant 
time. A7 (Complement) is detected by finding a collision between two separate ordered 
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lists, also easily done in (log) linear time. A9 (Also complement) consists in verifying 
if grandchildren of a node are also children, and since children are sorted this takes log- 
linear time in the number of grandchildren. Since a node is the grandchild of only one 
other node, the same computation as in the original algorithm holds. A10 and A11 take 
constant time. Hence, the total time complexity is O(n log(n)), as in the algorithm for 
tree isomorphism. 

As stated in Section 3 regarding the algorithm for DAG equivalence whose com- 
plexity we aim to preserve, the time complexity analysis crucially relies on the fact that 
in a tree, a node is never the child (or grandchild) of more than one node during the 
execution. However, this is generally not true in the presence of associativity. Indeed 
consider the term represented in Figure 4. The 5th | | has 2 children, but after applying 


OOOO, 


Fig. 4. A term with quadratic runtime 


A2, the 4th has 3 children, the 3rd has 4 children and so on. On the generalization of 
such an example, since an x; is the child of all higher | |, our key property does not hold 
and the algorithm runtime would be quadratic. Of course, such a simple counterexam- 
ple is easily solved by applying a leading pass of associativity reduction before actually 
running the whole algorithm. It turns out however that it is not sufficient, since cases of 
associativity can appear after the application of the other A-rules. 

In fact, there is only one rule that can creates case of rule A2, and this rule is A6 
(Involution). The remaining rules whose right-hand side can start with a |_| have their 
left-hand side already starting with |_|. It may seem simple enough to also apply double 
negation elimination in a leading pass, but unfortunately, cases of A6 can also be created 
from other rules. It is easy to see, for similar reasons, that only the application of A2b 
(_]@&) = x) can create such cases. And unfortunately, such cases of A2b can arise from 
rules A3 and A5 which can only be detected using the full algorithm. To summarize, 
the typical problematic case is depicted in Figure 5. This term is clearly equivalent to 
L |(x1, x2, x3. x4), but to detect it we must first find that z, and z, are equivalent to 0, so 
we cannot simply solve it with an early pass. 


5.3 Final Log-Linear Time Algorithm 


Fortunately, we can solve this problem at a logarithmic-only price. Observe that if we 
are able to detect early nodes which would cancel to 0, the problem would not exist: 
When analysing a node, we would first call the algorithm on all subnodes equivalent to 
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(s) 
US R9 5 tS 
So Bt 


Fig. 5. A non-trivial term with quadratic runtime 


nono a 


Fig. 6. the term of Figure 5 during the algorithm’s execution 


0, remove them and then when there is a single children left, remove the trivial disjunct, 
the double negation and the successive disjunction (as in Figure 5) before doing the 
recursive call on the unique nontrivial child. However, we of course cannot know in 
advance which child will be equivalent to 0. 

Moreover note (still using Figure 5) that if the z-child is as large as the non-trivial 
node, then even if we do the “useless” work, we at least obtain that the size a tree is 
divided by two, and hence the potential depth of the tree as well. By standard complexity 
analysis, the time penalty would only be a logarithmic factor. 

The previous analysis suggests the following solution, reflected in Figure 7 lines 
28-29. When analysing a node, make recursive calls on children in order of their size, 
starting with the smallest up to the second biggest. If any of those children are non-zero, 
proceed as normal. If all (but possibly the last) children are equivalent to zero, then 
replace the current node by its biggest (and at this point non-analyzed) child, i.e. apply 
second half of rule A2 (associativity). If applicable, apply double negation elimination 
and associativity as well before continuing the recursive call. 

We illustrate this on the example of Figure 5. Consider the algorithm when reaching 
the second | | node. There are two cases: 


1. Suppose z, is a smaller tree than the non-trivial child. In this case the algorithm 
will compute a code for z4, find that it is O and delete it. Then the non trivial node 
is a single child so the whole disjunction is removed. Hence, the double negation 
can be removed and the two consecutive disjunction of x, and x merged, obtaining 
the term illustrated in Figure 6. In particular we did not compute a code for the two 
deleted |_| nodes, which is exactly what we wanted for our initial analysis. 

2. Suppose z, is larger tree than the non-trivial child. In this case, we would first re- 
cursively compute the code of the non-trivial child and then detect that z; ~ 0. We 
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indeed computed the code of the disjunction that contains x, when it was unnec- 
essary since we apply associativity anyway. This “useless” work consists in sorting 
and applying axioms to the true children of the node (in this case x3, x4 and x4) and 
takes time quasilinear in the number of such children. In particular, it is bounded by 
the size of the subtree itself and we know it is the smallest of the two. 


Analogous situation can arise from the use of rule A3 (idempotence), but here triv- 
ially the two subtrees must have the same number of (real) subnodes, so that the same 
reasoning holds. 

Denote by |n| the size of a node, i.e. the number of descendants of n. We compute 
the penalty of useless work we incur by computing children of a node n in the wrong 
order, i.e. by computing a non-0 child n, when all other are 0. n, cannot be the largest 
child of n for otherwise we would have found that all other children are 0 before needing 
to compute n. Hence |n| < |n|/2. It follows that the total amount of useless work is 
bounded by log(|n|) - W (n), where 


W(n) € |n|/2-- 3 Wn). for 3 Inl < Inl. 


Itis clear that W (n) is maximized when n has exactly two children of equal size: 
W (n) < |n|/2 - 2- W (n/2) 
By observing that we can divide n by 2 only log(n) times, 


log(n) 


Wn) x Y, 2" |nj/2” 


m=1 


so we obtain W (n) = O(|n| log(|n|)) and hence the total runtime is O(n(log n)^). 


6 Conclusion 


We have described a decision procedure with log-linear time complexity for the word 
problem on orthocomplemented bisemilattices. This algorithm can also be simplified 
to apply to weaker theories. Dually, we believe it can be generalized to decide some 
stronger theories (still weaker than Boolean algebras) efficiently. While the word prob- 
lem for orthocomplemented /attices was known to be in PTIME [15] and as such the 
membership of orthocomplemented bisemilattices in PTIME may not come as a sur- 
prise, this is, to the best of our knowledge, the first time that this result has been ex- 
plicitly stated, and the first time that an algorithm with such low log-linear complexity 
was proposed for this or a related problem. The algorithm has not only low complexity 
but, according to our experience, is easy to implement. It can be used as an approxi- 
mation for Boolean algebra equivalence, and we plan to use it as the basis of a kernel 
for a proof assistant. We also envision possible uses of the algorithm in SMT and SAT 
solvers. The algorithm is able to detect many natural and non-trivial cases of equiva- 
lence even on formulas that may be too large for existing solvers to deal with, so it may 
also complement an existing repertoire of subroutines used in more complex reasoning 
tasks. For a minimal working implementation in Scala closely following Figure 7, see 
https://github.con/epfl-lara/OCBSL. 
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1 defequivalentTrees(tau: Term, pi: Term): Boolean — 

2 val codesSig: HashMap[(String, List[Int]), Int] = Empty 

3 codesSig.update(("zero", Nil), 0); codesSig.update(("one", Nil), 1) 

4 val codesNodes: HashMap[Term, Int] = Empty 

5 def updateCodes(sig: (String, List[Int]), n: Node): Unit = ... / codesSig, codesNodes 
6 def bool2const(b:Boolean): String = if b then "one" else "zero" 

7 def rootCode(n: Term): Int = 

8 val L = pDisj(n, Nil).map(codesNodes).sorted.filter( zz 0).distinct 


9 if L.isEmpty then ("zero", Nil), n) 
10 else if L.length == 1 then codesNodes.update(n, L.head) 
11 else if L.contains(1) or checkForContradiction(L) then updateCodes(("one", Nil), n) 
12 else updateCodes(("or", L), n) 
13 codesNodes(n) 
14 def pDisj(n:Node, acc:List[Node]): List[Node] = n match 
15 case Variable(id) > updateCodes((id.toString, Nil), n); return n :: acc 
16 case Literal(b) 2 updateCodes((bool2const(b), Nil), n); return n :: acc 
17 case Negation(child) > pNeg(child, n, acc) 
18 case Disjunction(children) > children.foldleft(acc)(pDisj) 
19 def pNeg(n:Node, parent:Node, acc:List[Node]): List[Node] = n match // under negation 
20 case Negation(child) 2 pDisj(child, acc) 
21 case Variable(id) 2 updateCodes((id.toString, Nil), n) 
22 updateCodes(("neg", List(codesNodes(n))), parent) 
23 List(parent)::acc 
24 case Literal(b) > updateCodes((bool2const(b), Nil), n) 
25 updateCodes((bool2const(!b), Nil), parent) 
26 List(parent)::acc 
27 case Disjunction(children) > 
28 val r0 = orderBySize(children) 
29 val r1 = r0.tail.foldLeft(Nil)(pDisj) 
30 val r2 = rl.map(codesNodes).sorted.filter(_# 0).distinct 
31 if isEmpty(12) then pNeg(r0.head, parent, acc) 
32 else val s1 = pDisj(r0.head, r1) 
33 val s2 = s1 zip (s1 map codesNodes) 
34 val s3 = s2.sorted.filter(_4 0).distinct / all wrt. 2nd element 
35 if s3.contains(1) or checkForContradiction(s3) 
36 then updateCodes(("one", Nil), n); updateCodes((" zero", Nil), parent) 
37 List(parent)::acc 
38 else if isEmpty(s3) then updateCodes(("zero", Nil), n) 
39 updateCodes(("one", Nil), parent) 
40 List(parent)::acc 
41 else if s3.length == 1 then pNeg(s3.head. 1, parent, acc) 
42 else updateCodes(("or", s3 map ( . 2), n) 
43 updateCodes((" neg", List(n)), parent) 
44 List(parent)::acc 
45 return rootCode(tau) == rootCode(pi) 


Fig. 7. Final algorithm. distinctBy runs in log-linear time. checkForContradiction detects appli- 
cation cases of A7 and A9 (Complement). Maintenance of size field used by orderBySize elided. 
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Abstract. Regression testing is an important activity to check software 
changes by running the tests in a test suite to inform the developers 
whether the changes lead to test failures. Regression test prioritization 
(RTP) aims to inform the developers faster by ordering the test suite 
so that tests likely to fail are run earlier. Many RTP techniques have 
been proposed and are often compared with the random RTP baseline 
by sampling some of the n! different test-suite orders for a test suite 
with n tests. However, there is no theoretical analysis of random RTP. 
We present such an analysis, deriving probability mass functions and ex- 
pected values for metrics and scenarios commonly used in RTP research. 
Using our analysis, we revisit some of the most highly cited RTP papers 
and find that some presented results may be due to insufficient sampling. 
Future RTP research can leverage our analysis and need not use random 
sampling but can use our simple formulas or algorithms to more precisely 
compare with random RTP. 


Keywords: Regression Test Prioritization - Random - Analysis 


1 Introduction 


Software developers commonly check their code by running tests. Regression 
testing [48] runs tests after code changes, to check whether the changes break 
the existing functionality. A test that passes before the changes but fails after 
indicates that the changes should be debugged (unless the test is flaky [25]). 
Finding test failures faster enables the developers to start debugging earlier. 

A popular regression testing approach is regression test prioritization (RTP) |12, 
19,21,23,38,39,48], which runs the tests from a test suite in an order that aims 
to find test failures sooner. For example, Google [14] and Microsoft [42] report on 
using RTP in industry. More formally, a test suite T' is a set (unordered) of tests, 
and RTP techniques produce a test-suite order—a permutation of the tests in 
the test suite—in which to run the tests. Various RTP techniques have been pro- 
posed in the literature since the seminal papers from 20+ years ago [12,36,38,47] 
that have garnered thousands of citations. 


(9 The Author(s) 2022 
D. Fisman and G. Rosu (Eds.): TACAS 2022, LNCS 13244, pp. 217-235, 2022. 
https://doi.org/10.1007/978-3-030-99527-0. 12 
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RIP techniques are often compared with random RTP. Our inspection [44] 
of the 100 most cited papers on RTP shows that 56 papers use random RTP 
as a comparison baseline. Although random RTP often performs worse than ad- 
vanced techniques, recent papers still use random RTP, because it has a small 
overhead and may perform well in certain scenarios. We additionally check pa- 
pers published in the latest testing conferences (ICST and ISSTA 2020/2021) 
and find that 50% (2/4) of the RTP papers [6,15,30,34] use random RTP. While 
random RTP has been used as a baseline for 20+ years, all evaluations have 
been empirical, performed by randomly sampling some of the n! orders for a test 
suite with n tests. The selected sample size varies (20, 50, 100, 200, 1000), with 
no clear correlation with n; some papers do not even report the sample size [44]. 
However, no prior work has presented a theoretical analysis of random RTP. 

Before we summarize our analysis, we describe some metrics and scenarios 
most commonly used in RTP research. We first introduce some terms: failure 
is simply a failing test, fault is the root cause (bug in the code) for the failure, 
and we say that a failure detects a fault if the failure is caused by the fault [36]. 
In general, many failures may detect the same fault, and one failure may detect 
many faults. We capture the relationship between failures and faults by a failure- 
to-fault matriz. To compare RTP techniques, researchers quantify how fast (test- 
suite) orders find all faults (not failures because having many failures that detect 
the same fault is not as valuable as having a few failures that detect many faults). 

RIP evaluations involve three aspects: RTP metric, failure-to-fault matrix, 
and allowed orders. The most widely used metric is Average Percentage of Faults 
Detected (APFD) [38], denoted as a for short. Another popular metric is Cost- 
Cognizant APFD (APFD,) [11], denoted as y for short. Section 2 formally defines 
these metrics based on the failure-to-fault matrix; each metric assigns to an order 
a value between 0 and 1, with higher values indicating better orders. Traditional 
RIP research used seeded faults, which allow fairly precisely deriving the failure- 
to-fault matrix [10, 22, 37] that can arbitrarily map failures and faults. Recent 
RIP research mostly uses real failures, e.g., analyzing real regression testing 
runs from continuous integration systems [14,15,23,24,27,34], making it rather 
difficult to precisely derive the failure-to-fault matrix. As a result, the increas- 
ingly popular failure-to-fault matrices are all-to-one, where all failures map to 
the same one fault, and one-to-one, where each failure maps to a distinct fault. 

To describe allowed orders, we note that real test suites often partition tests, 
e.g., in JUnit [20], each test method belongs to a test class. Traditional research 
ignores this partitioning and allows all n! orders (f2,(T) for short) of n tests. 
We introduced compatible orders [46] (2.(T) for short) that consider the parti- 
tioning and allow only orders that do not interleave tests from different classes. 

We present the first theoretical analysis for the cases most commonly used 
in RTP research. We introduce an algorithm for efficiently computing the ex- 
act probability mass functions (PMFs) of a for all failure-to-fault matrices and 
f24,(T). We demonstrate the efficiency of our algorithm on the benchmarks from 


^ Our original term was class-compatible [46] because we considered as tests only test 
methods in test classes, but the concept easily generalizes to other kinds of tests. 
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Order 01: C2.t2,C1.t1,C1.t3,C1.t2,C2.t1 Com. Order 02: C1.t3,C1.t2,C1.t1,C2.t2,C2.t1 
1007 APFD 1004 1007 APFD 1001 APFDc 

g a=0.7 o 00.57 

g 802 — -— g 802 — — 80 

o o 

2 2 

O 60-4 O 604 604 

a a 

2 2 

5 404 5 404 404 

g g 

xX 20- xX 20- 204 


100 100 100 


% of all tests % of total cost 926 of all tests % of total cost 
Fig. 1: Example metrics for two orders (Com. is compatible) for n = 5,m = 3; 
class C1 has 3 tests with costs (40, 20,60), class C2 has 2 with (100,80); C1.t1 
detects fault F1; C1.t3 detects F2; C2.t1 detects F2 and F3; C2.t2 detects F3. 


the largest RTP dataset for Java projects [34]. For the common all-to-one and 
one-to-one cases, we further derive a closed-form formula and a good approxima- 
tion, respectively. We also derive closed-form formulas for the expected values 
for both a and y for the general failure-to-fault matrix, for both (2,(T) and 
f2.(T), and we compare these values in various scenarios. Interestingly, on aver- 
age, f2, (T) can perform much better (up to 1/2) than £2, (T) for certain scenarios, 
but cannot perform much worse (only up to 1/6) for any scenario; Section 5.1 
presents this comparison, including two scenarios near the limits (1/2 and 1/6). 

We finally derive two interesting properties for the o and y metrics. Using 
these properties, we revisit some of the highly cited papers on RTP and find that 
some presented results may be biased due to insufficient sampling. Overall, our 
theoretical analysis provides new insights into the random RTP widely used in 
prior work but only via empirical sampling. Our results show that in many cases 
researchers need not run sampling but can use simple formulas or algorithms to 
obtain more precise statistics for the random RTP metrics. 


2 Preliminaries 


Our notation largely follows the prior work that introduced APFD (o) [38] and 
APFD, (y) [11], but we make explicit the failure-to-fault matrix. Let n be the 
number of tests and m be the number of faults detected by (some of) these 
tests. Let M be a failure-to-fault matrix, i.e., a n x m Boolean matrix such that 
Mj, = true iff (failure of) test j detects fault i, and each fault has at least one 
failure (i.e., Vi.dj.M;,;). Let T be the set of tests in the test suite. We denote 
the set of tests that detect the fault i as T; = (j|Mj,;). In general, T; and Ty 
for i Æ i' need not be disjoint because one failing test can detect multiple faults. 
The total number of failures is k = |[j|3i.M;;)|, and we use k; = |T;|. 

For an order o (a permutation of T), we use «, to compare the positions of 
two tests t and t’ in the order: t <o t’ denotes that t precedes t in o, and t €, t' 
denotes that t = t' or t <o t'. We denote the jt” test in an order o as t;(o). Let 
Ti(0) = min; M;,(5),;; be the position of the first test to detect the fault i in o. 
Prior work [11,38] defined metrics o and y (using the notation TF instead of 7). 
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We use a(o) and (o) to indicate o and y, respectively, for a given order o. We 
drop o from <o, <o, t;(0), r;(o), a(o), and y(o) when clear from the context. 
The most popular RTP metric is a [38], defined for an order o as follows. 


Definition 1 (o). APFD is defined as 


Mia 1 

eet nm ^n () 

Plotting the percentage of faults detected against the percentage of executed 
tests, a represents the area under the curve, as shown in two examples in Fig. 1. 
'The diagonal lines interpolate the percentage of faults detected and lead to nice 
properties of mean/median a values and symmetry (Section 6). a ranges between 
0 and 1, more precisely between 1/(2n) and 1 — 1/(2n). A larger a indicates that 
an order detects faults earlier, on average. 

While a effectively considers the number of tests, the “cost cognizant” metric 
y considers the cost of tests [11]. The cost can be measured in various ways, but 
most work uses the test runtime. We use c(t) to denote the cost (runtime) of a 


test t; the total cost of a set of tests T is o(T) = yep a(t). 


Definition 2 (7). APFD, is defined as 


Wa (When olt) - 1ott«)) 
m-oc(T) 


y= (2) 

Plotting the percentage of faults detected against the percentage of total 
test-suite cost, y represents the area under the curve, as shown in Fig. 1. Note 
that o can be viewed as a special case of y where Vt, t’ € T.o(t) = a(t’). 

In practice, tests often belong to classes?—e.g., JUnit [20] test methods be- 
long to test classes, Maven [28] test classes belong to modules, and pytest [35] 
test functions belong to test files—and tests from each class run together. Our 
prior work [46] defined compatible orders as those where all tests from each class 
are consecutive. We use To to denote the set of tests in a class C. An order o 
is compatible iff VC, j < j' € j".t;(o) € Te ^tj»(o) € To > ty (o) € Tc. For 
example, 02 in Fig. 1 is compatible, while o1 is not. To distinguish the cases for 
all orders from the cases for only compatible orders, we use the subscripts a and 
c; respectively, e.g., E4[x] and Ec[r] represent the expected value of x for the 
uniform selection of all orders and compatible orders, respectively, and P,(A) 
and P.(A) represent the probability of event A for the uniform selection of all 
orders and compatible orders, respectively. We denote the set of all orders and 
all compatible orders for T as (2,(T) and £2, (T), respectively [46]. 

We analyze RTP techniques in scenarios, each of which consists of a test suite 
with n tests, m faults, the failure-to-fault matrix, the cost of each test, and for 
f2,(T) the class of each test. To analyze compatible orders, we introduce some 
new notation to indicate the class of tests. We use T; c = T; N Tc to denote the 


5 The term class for a set of tests that run together need not represent a test class. 
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set of tests in class C that detect the fault ?. Let C be the set of all classes, and 
Ci be the set of classes that contain at least one test that detects the fault i, i.e., 
C; = (C € C|Ti, o F 0). Let C(t) be the class that t belongs to, i.e., t € Tot. 
The number of compatible orders is |f2.(T)] = |C|! I [eee |Tel! 

For a set of orders S, be it (2,(T) or 2-(T), the probability mass function 
(PMF) of a metric, a or y, is a function p from the metric value to its probability: 
p(x) = P(metric = x) = |(o € S|metric(o) = zj)|/|S|. We next derive some 
PMFs as all prior RTP work shows only sampled distributions of random RTP. 


3 PME ofa 


To analyze the PMF of the metric a, we first propose an algorithm to calculate 
the PMF of a for the general case of M. We then discuss two special cases, i.e., 
all-to-one and one-to-one, which are the most common in recent RTP research. 


3.1 Algorithm to Calculate PMF of a for the General Case 


To calculate the PMF of a, a naive algorithm would enumerate all n! orders and 
compute a for each order. In theory, o can take O(n!) different values, e.g., when 
m = 35. n! and all n tests fail and detect n, n?,...,n" different faults, then 
each of the n! orders has a different a. In practice, however, the number of faults 
m and the number of failing tests k are usually small, e.g., in our evaluation 
dataset [34], 2906 out of 2980 (98%) scenarios have k < 10. We present an 
algorithm that computes the exact PMF with O(n?mk - k!) time complexity. 
Despite the k! factor, the algorithm runs in reasonable time in practice, under 
30sec for any of the 2906 scenarios. When k > 10, one can resort to sampling. 

We next describe the intuition for our algorithm. 77, 7; is the only part of 
a that depends on the (test-suite) order, so we first calculate the PMF of this 
sum and then convert it to the PMF of a. Iterating over the faults does not 
lead to a nice recursive formulation. Our key insight is to instead iterate over 
the positions of all k failing tests. We view 3, , Ti as a weighted sum 


2 Tj — 5 wo; (3) 


where ó; is the position of the jt” failing test in the order, and w; > 0 is the 


weight, calculated as the number of faults detected first by the j'^ failing test 
(Line 11 of Algorithm 1). For example, consider the order ol in Fig. 1. The 
relative order of the k = 4 failing tests is p = (C2.t2, C1.t1, C1.t3, C2.t1); we use 
metavariable p to distinguish the notation from o for the order of all n tests. For 
this relative order, w = (1, 1, 1,0) because the m = 3 faults are detected first by 
C2.t2, C1.t1, and C1.t3. The positions for this relative order p are ó = (1,2,3,5) 
because the 4 failing tests in p appear in these positions in the order ol. 

We call a à = ($1...) valid if 1 < $1 « ... < dp < n. Both sequences 


$ and w = (w,...wy) can vary for different orders. While ¢ has (7) valid 


222 P. Yi et al. 


Algorithm 1: Calculate the PMF of « 


1 Input: njm,M // the number of tests and faults, and the failure-to-fault matrix 
2 Output: p // the PMF of a: p(x) = P(a = x) 
3 Function PMF() // main function; return the PMF of o for all orders 
4 k= |]31.M;,i)] // number of failing tests in M, in practice k«&n 
5 q = PMF. sum O // compute the PMF of 307, ri 
6 return Ax.q(mn — mnx + 3) // convert that PMF to the PMF of a 
7 Function PMF_sum() // return the PMF of »77* , 7; for all orders 
8 P = (PMF.rorder(p),Vp € perms((j|3i.M;,i))) // enumerate all relative orders 
9 L return Ax. Zoper p(x)/|P| // average PMFs of 507", 7; for each relative order 

10 Function PMF_rorder (p) // return the PMF of J` i>; 7i for a relative order p 

11 w= (16M, i ^ ij < j.Ms, it], Vj € 1..k) // w are the weights in formula (3) 

12 | return As.f(w,k,n) (s) // the total number of $ is (7) 

13 // the function should be memoized to reuse the results for the repeated w,g,h 

Function f (w, g, h) // return fg, given weights w, calculated with formula (4) 

14 if g > h then 

15 | return As.0 

16 if g = 0 then 

17 | return As.1,.—0 

18 return As.f(w, g,h — 1)(s)+f (Cw, g — 1, h — 1) (s — wsh) 


possibilities, we note that w has at most k! possibilities (with k! << (7) as 
k « n in practice) because w depends only on p. Therefore, we first fix w by 
enumerating the k! relative orders of the k failing tests. Then for each relative 


order, the problem of calculating the PMF of 377,7; = »- wjó; becomes 


“given w, count the number of valid @ such that bm wy; = s for each s", 
which can be solved recursively as follows. 

Let fy,n(s) be the number of assignments for the values of ¢),...,¢ such 
that 1 € $4 < ... < dg € h and Eu wjó; = s. The problem is to find 
fx, (s). As the base case, (1) fg, (s) = 0 for g > h because ¢, < g cannot hold; 
(2) fo.n(s) = 1s=0, where 1 is the indicator function, because only the empty 
sequence () is valid and Dj- wj; = 0. For all h > g > 0, the number of 
assignments for fg a(s) has two cases: (1) if g < h — 1, the number is equal to 
fg,n—1(s) by definition; (2) if 9; = h, the number for s is equal to the number of 
Se for $1,..., 5.1 such that Gg-1 < dg —1— h—1 and x wjó; = 
(3 $1 Wj) — wgóg = s — Wgh, which is f; 1s -1(s — wgh). In total, 


0 g>h 
f,n(8) = 4 1s-o g=0 (4) 
fg,n—1(8) + fg—1,n—1(S — Wgh) otherwise 


After solving fxn, we get the PMF of 377, 7; for each relative order of the 
k failing tests. Because each of k! relative orders has the same probability by 


symmetry, we simply take the average of their PMFs to get the PMF of $7;^, Ti 
for all orders. Finally, we convert the PMF of $77*, T; to the PMF of a. 
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Table 1: Number of tests, failures, runtime (in ms), and Jensen-Shannon (JS) 


distance for 10 largest scenarios [34] and one synthetic scenario (TSmax) 
Test ||##Tests|#Failures Runtime [ms] Jensen-Shannon 
suite (n) (k) all-to-one one-to-one|| distance (83.2.2) 
TS1 2118 1 513 505 0.0000 
TS2 1986 2 563 629 0.0005 
TS3 2080 3 617 871 0.0003 
TS4 1929 4 680 1147 0.0004 
TS5 1795 5 731 1408 0.0006 
TS6 339 6 627 732 0.0040 
TS7 465 7 678 756 0.0034 
TS8 813 8 829 2009 0.0023 
TS9 52 9 1496 1846 0.0442 
TS10 161 10 10989 27095 0.0150 
TSmax 2118 10 32801 242400 0.0011 


We next describe Algorithm 1 in more detail. The input is the number of tests 
n, the number of faults m, and the failure-to-fault matrix M. The main function 
PMF invokes PMF sum to get the PMF of 377, T; and converts it to the PMF of 
a. The function PMF_sum enumerates all relative orders p of the k failing tests, 
invokes PMF_rorder(p) to get the PMF of $77, T; for each relative order, and 
averages these PMFs to get the PMF of $77" 7; for all (relative) orders. Function 
PMF rorder(p) computes the weights w from formula (3), invokes f (w,k,n) to 
get fn for w, and converts it to the PMF of 377, Ti. 

We finally discuss the time complexity and the empirical performance of 
Algorithm 1. The major cost comes from computing the function f. Because 
there are O(k!) different w and 0 € g € k,g € h € n, we have O(nk-k!) different 
inputs for which to compute f. With memoization, f is computed only once for 
each input. Each computation takes O(nm) because |support( f,,,)| = O(nm) 
as 1 < T; < n for 1 < i < m. Therefore, the cost of computing f for all inputs 
is O(n?mk - k!). The other costs in the algorithm are lower than the cost of f; 
hence, the overall time complexity of Algorithm 1 is O(n?mk - k!). 
Implementation: While top-down recursion makes it easier to present the al- 
gorithm, for better performance our implementation uses bottom-up dynamic 
programming to compute f. Our implementation fits in only 117 lines of C++. 
Dataset: We use the RTP dataset with the most Java projects [34] for our 
evaluation. In this dataset, each test is a test class and each class is a Maven 
module [28]. The dataset has 2980 scenarios, and 2906 (98%) have k < 10. We 
select, for each k < 10, the scenario with the maximum number of tests (n) 
from the dataset. We also make a synthetic scenario with 2118 tests, being the 
largest number of tests in the dataset, and 10 failures. We use both all-to-one 
and one-to-one failure-to-fault matrices on the selected scenarios. 

Evaluation: As Table 1 shows, the code finishes in under 30sec (on a common 
laptop) for all real scenarios; it takes more time on the synthetic one for all-to-one 
and one-to-one, but the runtime is still 33sec and 4min, respectively. 
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3.2 PMEFEs of a for Special Cases 


As mentioned in Section 1, recent RTP research uses real failures and faults, 
with two kinds of failure-to-fault matrices: all-to-one and one-to-one. We discuss 
the PMFs of a for these two commonly used cases. 

3.2.1 All-to-One: We first derive the PMF of a for all-to-one. In this case, 
m = 1, k > 1, and wj = L,Vj > law; = 0 in formula (3). Therefore, the 
recursive formula (4) becomes fg a(s) = fg,n-1(8) + fg—1,n-1(8) for g > 1, which 
is similar to Pascal's triangle. This observation hints that the PMF of o for 
all-to-one may have a closed formula with binomial coefficients. 


Theorem 3 (The PMF of a for all-to-one failure-to-fault matrix). 


P(a=1 = =) = Se 
k 


n 2n 
Proof. For all-to-one, the a value depends solely on 71, which is essentially $4 in 
formula (3). For 1 < s <n—k+1, 7; = s holds as long as s = $1 <... < bp € n. 
To satisfy the condition, we just need to choose the k—1 positions after position s. 
Therefore, (271) out of (7) ways to choose k positions in n satisfy the condition, 


so P(ri = s) = (2-1)/ (5), and formula (5) directly follows. 


,S€ (L,,2,...,m— k^ 1) (5) 


With (5), we can use O(n) time to compute the PMF of o for all-to-one. We 
can compute the needed binomial coefficients iteratively, starting from (771) = 1, 


k-1 
with the recurrence (7*1) = eh (i ,), n' > k — 1, and get (2) = 2 (221). 


3.2.2 One-to-One: We next consider the PMF of a for one-to-one. In this case, 
m = k and each failing test finds a distinct fault, so for every relative order of 
the k failing tests, Vj.w; = 1 in formula (3). Therefore, running Algorithm 1 and 
memoizing on w, the complexity becomes O(n?Kk? + k!). k! is because we need 
to iterate through all the relative orders. We can avoid k! if we check in advance 
that the failure-to-fault matrix is one-to-one, so the complexity is O(n?K?). 
Moreover, considering formula (4) when Vj.w; = 1, fx, essentially models 
the problem “counting the number of partitions of s into k distinct summands 
from {1,2,...,n}”. Specifically, f; (s) can be viewed as the number of parti- 
tions of s into g distinct summands in {1,2,...,h}, and fgn(s) = fg,n—i(s) + 
fg—1,n-1(8—h) holds because the summand g can be either less than h or exactly 
h, corresponding to fgn—1(s) and fg 1,4 1(s — h), respectively. To the best of 
our knowledge, no closed formula is known for this problem. Considering that 
in our evaluation dataset, 99.8% (2975/2980) of scenarios have n?k? < 10°, the 
O(n?k?) algorithm is efficient enough for practical use for almost all cases. 
Approximation: Furthermore, we can approximate the PMF by ignoring the 
distinct-number constraint, ie., “counting the number of partitions of s into 
k summands from {1,2,...,n}”. This problem has a nice generating function 
(x +x? +... +x”)®, where the coefficient of z? is the number of partitions [43]: 


Eee o 
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We can calculate these coefficients using two algorithms with different tradeoffs. 
The first algorithm first pre-calculates the binomial coefficients with Pascal’s 
triangle and then calculates all the coefficients with formula (6). The first step 
takes O(nk?) because s — ni — 1 < nk and i < k. The second step takes O(nk?) 
because each of O(nk) coefficients takes O(k) to compute as | 5^ | < k. Thus, the 
overall time complexity of the first algorithm is O(nk?). The second algorithm 
calculates the generating function directly with the fast Fourier transform [4] by 
first converting x + z? +... + z^ to the point-value representation, calculating 
each point value to the kt” power, and interpolating to get the coefficients. The 
second algorithm takes O(nklog(nk)) because the length of the polynomial is 
O(nk). Comparing the complexity, the first algorithm is better when k is small 
compared to n (i.e., k — log k < logn), and the second is better otherwise. 

To evaluate the approximation, we use Jensen-Shannon (JS) distance [16] 
between the exact and the approximated PMFs. We check our approximation 
on the same real scenarios as in Section 3.1. As Table 1 shows, the approximation 
yields PMFs with a small JS distance, the largest only 0.0442 for n — 52, k — 9. 


3.3 PMF of «y 


The PMF of y is more complex than that of a because even for the simplest all- 
to-one failure-to-fault matrix, the number of possible values of y can be §2(2”). 
For example, consider n tests with costs 1,2, 4,...,2"-1, and only one test fails 
and detects the only fault. The y value depends on the sum of the costs of the 
tests that precede the failure. 2"-! different sets of the tests can precede the 
failure, and every set has a distinct sum of the costs. Even for the example in 
Fig. 1, the support of PMF for y (33) is much bigger than that for a (8). 


4 Expected Values for All Orders 92,(T) 


While some comparisons of RTP techniques use full samples of PMFs, many use 
just the arithmetic mean of the samples. We next derive formulas for expected 
values to obtain the mean faster and without the imprecision from sampling. 

In this section, we consider the case where order o is uniformly selected from 
2,(T), allowing n! orders of n tests. Because a is a special case of y where 
Vt,t' € T.o(t) = a(t’), we first derive y. 

To start with a simple example, consider a test suite with only one failing 
test (k — 1). For a random order, the test can be at any position with equal 
probability. Intuitively, the expected position across all of the orders is at the 
middle of the sequence, hence a and y should be about 1/2. In fact, we will show 
that they are exactly 1/2. Moreover, the expected values of both a and y are 
1/2 as long as each fault is detected by only one failing test (Vi.k; — |T;| — 1, 
which includes one-to-one). In general, the failure-to-fault matrix can be more 
complex: many tests could detect the same fault, and a test could detect many 
faults. To compute the expected values of o and y, we first prove a useful lemma. 
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Lemma 4. For every fault i, 


1 


Vt é T;.P,(t «t,)) 2 P,(N e Ti t «t) — 
Pe te) NOT CETSPeT c 


(7) 


Proof. Since 7; is the position of the first test from T; in the order, t precedes 
t,, iff t precedes every t’ € T;. Consider the relative position of each t ¢ T; with 
respect to all the tests from 7; in a random order. By symmetry, it is equally 
likely that t is in any of the k; +1 relative positions created by the relative order 
of the k; tests from T;. Therefore, the probability that t is in the relative position 
preceding all the k; tests from T; is Se 


We first use this lemma to compute E,[7]. 
Theorem 5 (The expected value of y for 2,(T)). 


m c(TNT;) c(T;) 
x kel + 2k. ) 
m-o(T) 


(8) 


Proof. From (2), the two key terms in y are o(t,,) and aes o(t;). By symme- 


try, any test t € T; can be the first in the order, or equivalently t = t,,, with 
probability = Thus 


Ealo (tr )] = 5 P(t =t,,)a(t) = me (9) 


tcT; 


Next, consider that = o(tj) = Veer o(t)li,,<t can be also calculated as 
Meer, P(t) le, <t + Viger, o) le, <t For every test t € Tj, t;, < t by definition, 
so Vt € Ti.Ea|1., <t] = 1. For every test t ¢ Tj, Ea[11,, <t] = Pitz, <t) = 
1— P,(t < t) = E The last equality stems from Lemma 4. Therefore, by 
the linearity of expectation, we get 


TL 


E. o(t)] = 0T) + oP VT) (10) 


IFT 


From (2), (9), and (10), we get (8). 
Corollary 5.1 (The expected value of o for (2,(T)). 


ntl)» nu 1 
Bylo] =1— e *D2am gn , 2 (11) 
nm n 


Revisiting the case where each fault can be detected by only one failing test, 
setting Vi.k; = 1 in (8) or (11), gives exactly 1/2 = E4[o] = E,[y]. In fact, even 
in the general case of any failure-to-fault matrix, we find that the two expected 
values are similar if not the same, inspiring us to derive the following bound: 
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Theorem 6 (The expected difference of a and y for 2,(T)). 


- 5 « Edo] - E] < 5 (12) 

Proof. From formulas (8) and (11), we have E4[o] — E4[(y] = A, — 4a + =, 
where A, — Dita lakr oe and Ay = EUH Since k; > 1, we have 
b < ie E l1 < 0 (with basic calculus, minimum is for k; = 2 or k; = 3), 


which, combined wath gu i) € o(T), gives -45 < Ay < 0. Since k; > 1, we 


also have 0 i: KH < 4, which gives 0 Ao = =. Thus, we have -h « 
A, — Ag tu < 5,- However, A, — Ao 4 x T 5 would require Aa = = = and 
aang Vi.k; = AE in "ich case A, = 0 and A, — A. +5, =04-G5 -"Uhetefore, 
the equality cannot hold and —45 < Eala] — "E aly] < La 


5 Expected Values for Compatible Orders 922.(T) 


In this section, we consider the expected values of a and y for (2,(T). Compatible 
orders do not interleave tests from different classes, as defined in Section 2. 
Similar to f2,(T), we first prove a useful lemma for 2.(T). 


Lemma 7. For every fault i, (note that ift € T;, C(t) may have another t € T;) 


C(t) € Ci 


c()ec, U9 


— 
vt ¢ T;.P.(t < tn) = P(E eTit« t) - | ICT oc) HU 
|C;|--1 


Proof. For C(t) € C; case, two conditions must hold for t ¢ T; cœ) to precede 
all tests that detect the fault i. First, among all classes in C;, C(t) must be the 
first in the order, and by symmetry, each class in C; can be the first with the 
same probability en Second, t must precede all tests from T; c), which (sim- 


ilar to Lemma 4) holds with the probability colt The two conditions are 


ES 
independent because they are about the class order and the test order inside the 
class, respectively, and these orders are independent of each other. Therefore, the 


t ma 1 
probability that t precedes the first test that detects the fault i is Ier eon 


For C(t) ¢ C; case, only one condition—C(t) precedes all classes in C;— 
must hold for t to precede the first test that detects the fault i, which (similar 
to Lemma 4) happens with probability TEE 


Theorem 8 (The expected value of y for (2.(T)). 
1 m 22 i o(Tc) 
Ech] dm as | uu t 


(14) 
o(To\Ti,c) , e(Tic) 
is | 2 cec, ( AE T jT, cl )) 
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Proof. We first compute the two key terms o(t,,) and 3. jer, 0 (0) in q. For each 
test t € T; to be the first, its class C(t) € C; should be the first among all classes 
in C; with PIENE EP and t must be the first among all tests in T; c) with 


M RÀ TE zi These two events are independent, so the joint probability 
is ETTET: By olt ri) = Juren, P(t) : List,,, we have 
1 T; 
-5 D as) 
ich |C; m. wl |C:| e [Tic] 


Next, consider 377 ... o (tj) = rer a(t): lr, <t. Each t is either (1) t € T;, where 
li, <+ = 1 by definition of rj; or (2) t ¢ Ti, where Ec[1,, <i] = Ec[1,,, <t] = 
P.(t,, < t) = 1— P,(t <t,,) can be obtained from Lemma 7. Combining these 
cases, we have 


Eel 75... o(t;)] = o(Zi) (s Dogc, o(To)+ (16) 


1 
eee, (1 exerts) «(Te \ Tao) 


From (2), (15), and (16), we get (14). 


Corollary 8.1 (The expected value of a for 2,(T)). 


T^ 1 Tc | +1 1 
E.|a| = 1 : H 17 


We next discuss the expected difference of Ec[o] and E.fy]. Unlike the case 
with (2,(T), where the difference has a rather small bound, we find that the 
difference can be rather large for (2. (T). 


Theorem 9 (The expected difference of a and y for 2.(T)). 
1 


- 5 «Edo] - Eh] € 7 - z- (18) 
Proof. From (14) and (17), we get Ec[o] — Ec[y] = Ay — Aa + dz, where 
Ay = aam Diet Hoge ore EET Doce, ( TOP + st | and 
la = m + ET | ? cec; elut). A, > 0 because all the terms 


in A, are positive. From Vi, C € C;.|C;| > 1,|Ti,c| > 1, we have 


m 2 ,200) o(To\Ti,c) a ) 
A, < mam y» CET T i268 ( HI E i " ) 


= mat) ! 2X4 c(T) = i 


Similarly, 
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1 m cec, [Te] |To|+1 
Aa X nm Žil AESI E Ej X cec, |a ) 
-el (|Te |--1) 
d m cec; c CEC; t ll m n al n+l 
= nm Zi 2|C;| WC | ) ^nm i-i eg bg 3) s 2n 


0 < [Ti c| € |To|, we also have A, > +. Combining 0 < A, < i and 
1 < A, < B we get 1 < A, — 4a 4 PI d 


a |l In 2 2n 


Considering many inequalities in the preceding proof, one may expect the 
bounds to be loose, but we show two scenarios where bounds are close to tight. 
Both scenarios have only one fault. Scenario one has two classes: C1 has only 
one passing test t with cost qN (q > 0 is arbitrary), and C5 has N failing 
tests each with cost 4. We assume N > 1. t must be the first or last in any 
compatible order, each with probability 1/2 (when C; is first or second). E,[a] is 
close to 1, and E [5] is only about 1/2. Precisely, E.[a] — E.[7] = Nd 2: 2 
when N > 1. Scenario two has two classes: Ca has N failing tests with cost * 
and C3 has N? passing tests each with cost qs. The two classes have only two 
orders, each with probability 1/2. Ec[5] is clase to 1, and E,[a] is only about 


1/2. Precisely, Ec[o] — Ec[y] = NHI wa + shy © —4 when N >1. 


5.1 Comparison of Na(T) and 2.(T) 


Orders that are compatible have more constraints on the PMF, which could 
increase or decrease average a or y values. To compare how orders in f2,(T) and 
2.(T) perform on average, we compare Eala] with Ec[o] and E4[y] with E,[y]. 


Theorem 10 (Difference of E. y] and E,[y]). 
(19) 
Proof. From (8) and (14), we have 
1 m c(Ti) , o(T\Ti »» ,9 (10) 
Ech] - Eal] = meo(T) Pu ( ie E a CHEST (20) 
BU p» o(Tc\Ti,c) , e(Tic) 
BA CEC; IT: c|+1 ! 2/Ti,c| 
Because Vi.1 € k; € n,|Ci| > 1,|T;.| 2 1, we have 


EHI - E S m EG -JD-4-i 
For the other side, because Vi.|C;| € ki, |Ti,c] € ki, we have 


m c(T; c i 2 , e(Te) 
Ec] - EA y] € mot) x ( (T) |, e(TNT;) CEC, c 


2k; ! ki+1 ki+1 
(ec, o(To))—o (Ti) c(Ti) 
[Ci] (k:+1) 2|C;|k; 
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The third last inequality holds because Vk; > ls — 3L > 0. The last inequal- 


ity holds because V|C;| > Loh, < v which can be shown with simple 


calculus, and J /oec, e(Te) € o(T). 


Corollary 10.1 (Difference of E,[a] and Ea[a]). 


——=<E-,[a] — Eala] € (21) 


Ql = 


We give two scenarios where the preceding bounds are close to tight. In both 
scenarios, we set Vt, t € T.o(t) = c(t), so that a = y and Esla] — Eala] = 
E.[y] — E4 y]. The first scenario has one fault F, each of the |C| classes contains 


Ie tests, and tests from only one class detect F but all tests in that class detect 
F. In this scenario, E,[o] = 1 — G2 + 4L, and Eco] = 1 — Se - 4. 
If we consider |C| = yn, when n > 1, E,[a] ~ 1 but E,[a] ~ 1/2, hence 
E.[o] — E4[o] ~ —1/2. The second scenario has one fault F and two classes with 
1 and n — 1 tests, and each class contains only one test that detects F. In this 
scenario, E4[o] = 2 — + and E,[a] = 2. When n > 1, Ec[o] — Eala] ~ i5, close 
to the upper bound of 1/6. 

In brief, measured by o or y, compatible orders can be much worse on average 


than all orders (up to 1/2) but cannot be much better (up to 1/6). 


6 Properties of Metrics and Checking Prior RTP Work 


Prior work on random RTP uses sampling and often visualizes œ and y values 
as boxplots that may show the median, mean, quartiles (2596 and 7596), and 
“whiskers” (1.5 times the interquartile range) of the sampled distribution. For 
papers that show these boxplots, we identify two properties for the boxplots, fo- 
cusing on £2, (T) because it is used in almost all prior work instead of 2.(T) [46]: 


— Mean/Median at Least Half: E4[o], Meda (a), E4[5], Meda(y) > 1/2. 
— Symmetric PMF: E4[o] = 1/2 & Med4(o) = 1/2 © E45] = 1/2 & 
Med, (y) = 1/2 © Vi.k; = 1 = PMFs of a and y are symmetric around 1/2. 


To check the boxplots from prior work, we search on Google Scholar for 
papers related to "test prioritization" and keep only the papers that contain 
both “test” and “prioriti” in the titles. We sort these papers based on their 
citation count and check the top 100 papers with the highest citation count [44]. 
6.1 Mean/Median at Least Half 
Lemma 11. Vo € 92,(T) and its reverse order o € 2,(T), 

(o) + (6) 2 1 (22) 


The equality holds iff Vi.k; — 1. 
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Proof sketch. To give some intuition, when Vi.k; = 1, the test that detects the 
fault i first does not change by reversing the order, so the “prefixes” of the test 
in o and 9 complement each other and form the entire test suite. In this case, 
^y(o) + y(o) = 1. If Ji.k; > 2, the test that detects the fault i first in o is not the 
same test in o, and the “prefixes” of these two tests in o and o do not form the 
entire test suite, so y(0) + y(6) > 1. We omit the details due to space limit. 


Theorem 12 (Measures of central tendency are at least half). 
min{E,[a], Med; (a), Ea [y], Meda(y)} > 1/2 (23) 


The equality holds iff Vi.k; — 1. 


Proof sketch. From (22), we get E4[y] = 4- Sena > 4 and the 
equality holds iff Vi.k; = 1. Because a can be viewed as a special case of y, we 
also have the same result for Eala]. The same result for Med4(o) and Med4(») 


can also be derived from (22). We omit the details due to space limit. 


When we inspect the top 100 most cited RTP papers, we find at least five 
papers with boxplots clearly showing a mean or median below 1/2. These papers 
range from seminal papers [12, Figs. 2b, 2c, 2e] (year 2000) and [13, Fig. 3: 
schedule, tcas] (2002) to more recent [29, Fig. 4] (2007), [5, Fig. 2] (2016 — a co- 
author of this prior paper is also in this paper), and [41, Fig. 5] (2017). Instead of 
sampling random orders for an arbitrary number of times, future RTP research 
could use our formulas or algorithm to obtain correct mean and median values. 


6.2 Symmetric PMF 


We also prove that œ and y PMFs are symmetric when (23)’s equality holds. 


Theorem 13 (Symmetry of the a and y PMFs). Jf E4[o] = 1/2 V Meda(a) = 
1/2 v Ealy] = 1/2 v Meda(y) = 1/2 v Vi.k; = 1, then 


Vi.P(a = 1/2—6) = P(a =1/2+6) ^ P(y 21/2—6) = P(y 2 1/2 -6) (24) 


Proof. From Theorem 12, min{E,[a], Meda (a), E,[y], Meda(y)} = 1/2 v Vi.k; = 
1 & Vi.k; = 1 > Vo.a(o) + a(0) = 1 A (o) + (0) = 1. Each order has exactly 
one reverse order, so the PMFs of a and y are symmetric around 1/2. 


When we inspect the top 100 most cited RTP papers again, we find at least 
three papers relevant to this property. Based on the information in these pa- 
pers, we believe that Vi.k; = 1 is true. Ideally, we would confirm each paper's 
failure-to-fault matrix, but papers often omit such details. On a positive note, 
the authors of one paper [38] released their dataset, which we analyze and con- 
firm that Vi.k; — 1. The papers that violate this property include the most 
widely cited paper on RTP [38, Fig. 5: schedule, schedule2, tcas] (year 2001; 
1563 citations per Google Scholar) and others, both older [36, Fig. 4: schedule, 
schedule2, tcas] (1999) and newer [40, Fig. 2] (2015) papers. 
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Instead of randomly sampling orders to approximate PMFs, future RTP pa- 
pers could use our algorithm to compute exact PMFs. While we find only five 
and three papers that definitely violate Mean/Median at Least Half and Sym- 
metric PMF, respectively, we suspect that many others may violate these or 
similar properties. However, due to the lack of data in many papers (e.g., no 
boxplot for random RTP), we cannot easily identify all violations. 


7 Related Work 


Some prior work [45,49] considers expected values of a and y but in different 
contexts from ours. Random testing (but not random RTP) has been studied for 
a while [7—9,17,18,31-33,50]. The most related are theoretical analyses of random 
test generation. Bóhme and Paul [2,3] analyze how random sampling of test 
inputs compares to systematic generation: random can be more efficient when 
the cost to systematically generate a test input exceeds the cost to randomly 
sample an input by some factor. Bóhme et al. [1] analyze the connection between 
Shannon's entropy and the discovery rate of a fuzzer that randomly generates 
inputs. They provide the foundation for identifying random seeds for the fuzzer 
to improve the overall efficiency. Their analysis also enables future systematic 
approaches for test generation to be more efficiently compared with random. 
Similarly, our analysis can help future RTP work more efficiently compare against 
random RTP and avoid insufficient sampling. Beyond random test generation, 
Majumdar and Niksic [26] present a theoretical analysis on the effectiveness 
of randomly inserted partition faults to find bugs in distributed systems. In 
contrast, our analysis is on test-suite orders for random RTP. 


8 Conclusion 


Regression test prioritization (RTP) is a popular regression testing approach. 
Majority of highly cited RTP papers have compared RTP techniques with ran- 
dom RTP. However, all evaluations have been empirical, with no prior theoretical 
analysis of random RTP. This paper has presented such analysis, by introduc- 
ing an algorithm for efficiently computing the exact probability mass function 
of APFD, deriving closed-form formulas and approximations for various metrics 
and scenarios, and deriving two interesting properties forAPFD and APFD.. 
Overall, our analysis provides new insights into the random RTP, and our re- 
sults show that future RTP work often need not use random sampling but can 
use our simple formulas or algorithms to more precisely evaluate random RTP. 
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Abstract. First-order temporal logics and rule-based formalisms are two popular 
families of specification languages for monitoring. Each family has its advan- 
tages and only few monitoring tools support their combination. We extend metric 
first-order temporal logic (MFOTL) with a recursive let construct, which enables 
interleaving rules with temporal logic formulas. We also extend VeriMon, an 
MFOTL monitor whose correctness has been formally verified using the Isabelle 
proof assistant, to support the new construct. The extended correctness proof 
covers the interaction of the new construct with the existing verified algorithm, 
which is subtle due to the presence of the bounded future temporal operators. We 
demonstrate the recursive let’s usefulness on several example specifications and 
evaluate our verified algorithm’s performance against the DejaVu monitoring tool. 


Keywords: Rule-based specifications - Monitoring - Formal verification. 


1 Introduction 


In runtime verification, a monitor observes events generated by a running system and 
analyzes the event streams for compliance with a given specification. Temporal spec- 
ification languages for monitoring are often classified as operational or declarative [10]. 
Operational languages explicitly describe how the monitor's input should be transformed 
to obtain an output. Two important subclasses of operational languages are rule-based for- 
malisms [2,13] and stream runtime verification (SRV) languages [6,8,11,20]. Both formu- 
late the transformations as recursive equations. In contrast, declarative languages, such as 
first-order temporal logics [4, 15], describe the output by composing high-level operators. 

Operational and declarative languages have complementary advantages: declarative 
languages let specification authors focus on the “what” and not the “how”, whereas 
operational languages offer the authors more control over the evaluation. Most runtime 
verification tools do not support mixing the paradigms, especially when it comes to 
parametric, i.e., first-order, specification languages. A notable exception is the recent 
addition of recursive rules to past-time first-order temporal logic (PFLTL), implemented 
in the DejaVu monitoring tool [14]. As another important benefit, recursive rules can 
express operations like transitive closure that are not expressible in first-order logics. 

In this paper, we introduce recursion in metric first-order temporal logic (MFOTL) [4] 
in the form of a recursive let construct. We develop and implement an evaluation al- 
gorithm for MFOTL with recursion in VeriMon [3,21], an MFOTL monitor whose 
correctness has been formally verified in the Isabelle proof assistant. To this end, we 
extend the formal correctness proof to cover the recursive let construct. 


(9 The Author(s) 2022 
D. Fisman and G. Rosu (Eds.): TACAS 2022, LNCS 13244, pp. 236-253, 2022. 
https:/ /doi.org/10.1007/978-3-030-99527-0. 13 
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Unlike PFLTL, MFOTL supports bounded future temporal operators and aggrega- 
tions (Section 2). The interaction of recursion with bounded future operators is subtle. 
To avoid non-termination, DejaVu requires all recursive occurrences to be guarded by 
a previous operator. We similarly require the recursive occurrences to be guarded in our 
monitor, but we relax the requirement on the guard to other past-time operators which 
ensure that their subformulas are evaluated strictly in the past. Moreover, we allow future 
operators in the recursive let construct, as long as no recursion takes place in the future op- 
erator’s arguments. These restrictions ensure that the fixpoint given by the recursive let op- 
erator is well-defined. At the same time, they are permissive and allow us to formulate in- 
teresting examples, several of which are beyond what PFLTL with recursion can express. 

Consider a specification that aims to secure hosts in a network that communicate with 
each other and with the outside world. A host is tainted by an address range iff there is a 
chain of communication from the address to the host and all hosts on the chain trigger an 
intrusion detection alert within one hour after communicating with the previous host. This 
specification can be expressed directly using our recursive let construct (to model chains 
of communication) and future temporal operators (to specify “within one hour after"). 

We start by extending MFOTL with a non-recursive let operator (Section 3). This spe- 
cial case is mainly of pedagogical value: aspects common to both let operators are easier 
to explain on the simpler non-recursive variant. Yet, this construct is useful in practice 
to structure complex formulas and improve monitoring performance by sharing common 
subformulas. Thus we extend VeriMon's algorithms and proofs with the non-recursive let. 

We then introduce the recursive let operator (Section 4.1), exemplify its semantics 
with several specifications (Section 4.2), and develop the monitoring algorithm and sketch 
its correctness (Section 4.3). VeriMon's repository [24] contains complete formal proofs. 

This work is part of the long-term effort to develop a trustworthy monitor that 
surpasses in expressiveness and efficiency other non-verified tools. In this work, our focus 
is on expressiveness (and trustworthiness). Nonetheless, we evaluate our algorithmic 
additions to VeriMon on a micro-benchmark and observe that even without further 
optimizations it exhibits an incomparable performance to DejaVu (Section 5). Moreover, 
we detected a problem in DejaVu's handling of variable names in recursive subformulas. 

In summary, our main contribution is the extension of MFOTL with a recursive let 
operator and the design of an evaluation algorithm for it. Along the way, we introduce a 
non-recursive let operator, which proved essential when writing complex specifications. 
Our contributions are implemented as part of VeriMon and proved correct using Isabelle. 


Related Work. Our work adds rule-based specification features [13] to a first-order spec- 
ification language [16]. Above we describe our contribution's relationship to DejaVu and 
VeriMon, two monitors for first-order temporal specifications. VeriMon's algorithm [21], 
which we extend, is based on the algorithm used in the MonPoly monitor [5], although Ve- 
riMon has optimizations that are not present in MonPoly and vice versa [3]. VeriMon sup- 
ports a more expressive specification language than MonPoly, and our introduction of the 
recursive let has increased the gap between the two. VeriMon's and MonPoly's algorithms 
work with finite relations. These tools are thus restricted to MFOTL’s monitorable frag- 
ment [4], which ensures that all subformulas evaluate to finite results. In contrast, DejaVu 
finitely represents infinite relations using BDDs and thus supports the full PFLTL (but 
only closed formulas). Both DejaVu and our work restrict the recursive let syntactically. 
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datatype data = Int int | Flt double | Str string type synonym ts = nat 
type synonym db — string — data list set typedef trace = {s :: (db x ts) stream. trace s) 
datatype trm = V nat | C data|trm+trm|... typedef T = { (a :: nat,b :: enat). a < b} 


datatype frm = string(trm list) | trm o trm | ^ frm | 3 frm | frm V frm | frm ^ frm 
| @z frm| Oz frm | frm Sz frm | frm Uz frm | nat + agg op(trm;nat) frm 
fun etrm :: data list = trm = data where 
etrm v (V x) 2 v!x|etrm v (C x) = x | etrm v (tj +) = etrm v tj 4- etrm v h |... 


fun sat :: trace > data list => nat => frm = bool where 
sat o v i (p(as)) = (map (etrm v) aseF o i p) | sato vi (tj ot5) — (etrmv t; o etrmv h) 
sat o vi(^5g)-—(^sata vig) | sat o vi (Ag) = (Az. sat o (z#v) iy) 
sato vi(a VB) = (sato viaVsato vif) | satovi(@AB) =(satoviaAsato vif) 
sat 0 vi(6;9) = (casei of 0 = False | j+1> Toc i-To jezl^sato v jg) 
sato vi(Org) 2 (To (i41) Ta iezIA^sato v (i4- 1) g) 
sato vi(aS;jBg) =(Aj<i.Toi-To jezlI^sato v jBA(Vke (j «..i].sato vka)) 
sato vi(aUjB) =(sj>i.To j-T co i€gI^sato v jBA(Vke (i..« jj. sata v ka)) 
sat o vi (y — Q(t;b) o) = (let M = ((x; card? Z) | x Z. 
Z = (z.length z=bAsat o (z @ v) ip^etrm (z@v) t2 X) AZz () 
in (M = {} — fv pC {0 ..< b}) ^v!y = eval agg op 2 M) 


~ 


Fig. 1. Formal syntax and semantics of MFOTL with aggregations, where o € {=,<,<} 


Other rule-based [2,13] and SRV-based monitors [6,8,11,20] can express the temporal 
operators present in LTL, but struggle with extensions that introduce parameters. Even 
for the operators they can express, specialized algorithms that are carefully tuned for the 
operators tend to exhibit a better performance. Instead of encoding temporal operators, 
we take the opposite approach and enrich a monitor that uses specialized algorithms for 
temporal operators with general-purpose recursion. 

Datalog [1] adds recursion to first-order logic, similarly to our addition of recursion to 
temporal logic. However, Datalog has no built-in notion of time and hence other measures 
must be taken to ensure that the fixpoints are well-defined, e.g., by restricting negation. 
Restricting the recursive occurrences to be strictly in the past is a natural and expressive 
alternative for monitoring, as we do not restrict negation beyond of what the monitorable 
fragment requires. Works on Datalog extensions with metric temporal operators [7,19,22] 
mostly study the decidability and complexity of computational problems related to these 
extensions, whereas we design, implement, and formally verify an executable algorithm. 


2 Metric First-Order Temporal Logic 


MFOTL extends linear temporal logic with first-order quantification, past-time operators, 
and interval bounds on the temporal operators [4]. The VeriMon monitor [3] supports 
a fragment of this logic. It also adds new features, specifically regular matching oper- 
ators as in linear dynamic logic [9], which results in metric first-order dynamic logic 
(MFODL), as well as aggregations. Our extension of VeriMon with recursive rules retains 
the additional features of MFODL. However, the additional features are orthogonal to our 
extension and hence we base our presentation in this paper on MFOTL with aggregations. 

We summarize MFOTL’s syntax and semantics, as well as the monitorable fragment. 
The presentation generally follows the Isabelle formalization; however, we sometimes 
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deviate from Isabelle's concrete syntax for simplicity. We begin by defining some 
auxiliary types (top of Fig. 1). The logic's universe (type data) is fixed and infinite: it is 
a disjoint sum of integers, 64-bit IEEE floats, and strings of 8-bit characters. Databases 
(type db) encode first-order structures as functions from predicate names to relations 
over data. Relations are represented as sets of lists. A trace is a stream (an infinite 
sequence) of time-stamped databases. Time-stamps (type ts) are modeled as natural 
numbers (type nat). We write l o i for the ith database in o, and T c i for its time-stamp. 
The predicate trace enforces monotone and eventually increasing time-stamps, i.e., 
Vix j. Toi<To jandVx.Hi. x « To i. Non-empty intervals (type Z) are represented 
by their end-points. We write [a,b] for the unique interval satisfying n Ez [a,b] iff 
a<n < b, where n €z I denotes that J contains the natural number n. The interval is 
unbounded from above if b — co, which the type enat adds to the natural numbers. 

Terms (type trm) are constructed recursively from variables (represented by De Bruijn 
indices), constants, and arithmetic operators. We use named variables in examples and 
omit the V and C constructors. There are two kinds of atomic formulas (type frm): 
flexible predicates of the form p(as), where as is a list of terms, and rigid predicates 
ti © fo foro € (2,«, €), which have a fixed interpretation. Formally, the existential 
quantifier 4 does not carry a variable name because of the De Bruijn encoding. We use 
fv a to denote the set of De Bruijn indices of o's free variables. 

The semantics is given by the functions etrm and sat (Fig. 1). Both depend on a valu- 
ation, which is a data list assigning a value to each variable. The satisfaction function sat 
for formulas additionally depends on a trace o and a time-point i, which is an index into 
the trace. Indexing into lists is denoted by v! x, the operation z#v prepends the value z to 
the list v, and @ concatenates two lists. The notation {x ..« y) and (x <.. y} is shorthand 
for the sets {x,x+1,...,y—1} and {x+1,x+2,...,y} of natural numbers, respectively. 

An aggregation formula y + Q(t;b) y binds b variables in the subformula g; the 
remaining free variables of y are used for grouping. Each group is assigned an aggregate 
value y, which is computed by first evaluating the term ¢ on each valuation that matches 
the group and that satisfies o, then aggregating the results using the operator Q (e.g., 
MIN for minimum). To this end, eval agg op 2 M (not shown) applies Q to a set M of 
value-multiplicity pairs [3]; card” Z is the cardinality of Z, or ee if Z is infinite. The con- 
junct M = (1 — fv y C (0 ..« b} ensures that the formula is satisfied by the aggregate 
value of an empty M only if there are no grouping variables. Otherwise, infinitely many 
groups would be labeled with that value, rendering such aggregations non-monitorable. 

The decidable predicate mon :: frm — bool specifies the monitorable fragment. We 
omit its formal definition and refer to the earlier descriptions of VeriMon [3,21] for details. 
Intuitively, mon places restrictions on the formula's structure to ensure that all subfor- 
mulas have finitely many satisfying valuations. Also, the interval J of every Uz operator 
must be bounded. A monitor for a monitorable formula can thus compute a finite set of 
satisfying valuations for every time-point after observing a sufficiently long trace prefix. 


3 Non-Recursive Let Operator 


We first introduce a non-recursive let operator Let string :— frm in frm to the frm datatype. 
The formula Let p :— a in 8 associates the formula œ with the predicate named p, which 
may be used in the formula 8. We call such a predicate let-bound. The operator is 
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non-recursive: p has the same meaning within a as in the surrounding context (unless it 
is bound by a nested let in o). Although the non-recursive let operator does not enhance 
MFOTL’s expressiveness, it improves readability (by using descriptive let-bound pred- 
icate names), as well as modularity and evaluation efficiency (by sharing subformulas). 

Intuitively, the meaning of Let p :— a in B is the same as that of B after replacing 
all its predicates of the form p(as) with the formula a, whose free variables have been 
replaced with the terms as in a capture-avoiding way. The formal syntax does not spec- 
ify explicitly how a’s free variables map to p's arguments. The mapping is induced 
by the De Bruijn indices: the variable with index 0 becomes the first argument, and 
so forth. We list the arguments explicitly in examples that use named variables. For 
instance, the formula Let p(x) := p(x) Ady. q(x.y) in 6105; p(y) should be equivalent to 
@ (0,2) (p(y) ^ 3z. q(y.z)). We achieve this by defining Let's semantics as follows. 


sat 0 v i (Let p := a in) = sat (oc|p > Aj.satrel o j aj) vi B 


We write satrel o j œ as an abbreviation for (v. sat o v j œ A length v = nfv a}, i.e., 
the relation containing the valuations that satisfy œ. The function nfv a returns the 
minimum length of v needed to cover all of a’s free variables, i.e., 0 if œ is closed and 
Max (fv a) + 1 otherwise. The trace o[p = R] is the same as the trace o except that for 
every time-point i, the database at i maps the predicate name p to R i, where R has type 
nat — data list set and is called a temporal relation. Note that the subformula a is not 
necessarily evaluated at time-point i. Instead, the choice of the time-point is deferred 
until the predicate p is used within 6, which we achieve by updating the entire trace. 
This supports the intuition behind unfolding the let operator Let p :— a in B described 
above, especially as subformulas p(as) may occur under temporal operators in f. 


Implementation. To evaluate an MFOTL formula on a trace, VeriMon computes a 
finite set of satisfying valuations (represented by the type table) recursively for each 
subformula. It applies standard table operations such as the natural join (>) and union. 
Tables are sets of tuples, which are lists of optional data values (with missing values 
denoted by L) and thus refine valuations. This representation allows us to use lists of 
the same length for subformulas with different free variables. As with valuations, the 
variables' De Bruijn indices are used to look up their value in a tuple. 

VeriMon processes an unbounded trace incrementally. Its interface consists of two 
functions init :: frm > state and step :: dbs x ts list => state = (nat x table) list x state. 
The function init initializes the monitor's state (type state), and step updates it with 
a batch of new time-stamped databases to produce a list of new satisfactions. Instead 
of db list, step uses the type dbs — (string — table list) (a partial mapping from string 
to table list) to efficiently retrieve all relations (encoded as tables) associated with a 
predicate name at once. Besides some auxiliary data, state stores an inductive state of 
type sfrm that mirrors the inductive representation of formulas, augmented with data 
structures for evaluating temporal operators and buffering intermediate results. Inter- 
nally step (dbs,tss) st calls eval j n tss dbs sy, where j is the combined length of the 
trace prefix including the new batch, n = nfv o for the monitored formula o, and sy is 
the inductive state, all stored in st. The function eval returns a list of tables with new 
satisfactions, as well as the updated inductive state. Satisfactions are reported for every 
time-point in order. They may be delayed if the formula contains future operators. 
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To evaluate Let p := a in B, we use the tables with a’s satisfactions to evaluate 
p within B, which requires that the tuples in these tables do not have missing values. 
Therefore, we require that let operators satisfy mon (Let p := a in B) = ((0..« nfv a} C 
fv à ^ mon a ^ mon f). Specifically, the (indices of) œ’s free variables must not have 
gaps. We add the constructor SLet p m s,5g to the inductive state, which stores p, the 
number m = nfv a of free variables in a, and the states for subformulas «œ and f. It is 
initialized by initializing s and sg recursively. The function eval evaluates it as follows. 


eval j n tss dbs (SLet p m Sa sg) = 
(let (xs, s) = eval j m tss dbs so; (ys, sg) = eval jn tss (dbs|p +> xs]) sg 
in (ys, SLet p m s’, 55)) 


We write dbs|p — xs] for the partial mapping dbs updated at p with xs. The recursive call 
of eval on s, may return multiple tables in the list xs. Note that step generalizes the orig- 
inal VeriMon interface [3] as it consumes multiple time-stamped databases at once. The 
generalized interface of eval allows us to pass all tables at once to the recursive call for sg. 


Correctness. We relate the outputs of step and sat to prove our monitor correct. As men- 
tioned earlier, the monitor may delay its output. We precisely characterize its progress 
for a given formula and trace prefix. Intuitively, the progress is the number of time-points 
that the monitor is able to evaluate given a trace prefix. Progress is a useful tool in the cor- 
rectness proof as it helps us describe the output at every time-point. Moreover, we show 
below that progress can be made arbitrarily large, which is important for completeness. 

Formally, prog o P q jis y’s progress i, after reading the first j databases of trace c. 
We added the partial mapping P that assigns to every let-bound predicate its own progress, 
i.e., the progress of the formula defining the predicate. For example, the progress of 
a predicate p that is not let-bound is j. Otherwise, it is equal to the progress of the 
formula it is bound to (stored in P p). The progress of œ Uy, jj B is the smallest i such that 
tai>to (Min [io,ig, j — 1)) — b. The progress of both œ ^£ and a V B is Min (i5, ig]. 

The invariant invar c j P n s; y relates an inductive state s; to the formula y. The 
inductive state must reflect the monitor's state after processing the first j databases in 
the trace ø, assuming that P specifies the let-bound predicates’ progress. The parameter 
n is the length of the tuples stored within sọ. The invariant is defined inductively over 
Sy; we reuse VeriMon's definition for the MFOTL operators and add a case for Let: 


invaro j Pmses a invar (o[p => Ai. satrel c i al) j (P[p — prog o Pa j]) n se B 
m=ntva {0.<m}Cfva 


invar © j Pn (SLet p m Sq sg) (Let p :— a in B) 


The first two premises restrict the subformula states sg and sg, where sg reflects the eval- 
uation of 6 on the modified trace, and p's progress is that of a. The premise m = nfv a 
enforces that m is equal to p's arity, and (0 ..« m] C fv aris the constraint from mon. 
Our extensions preserve the monitor's correctness: we formally proved the theorem 
below, which characterizes the monitor's eval function. The theorem is stated here for 
the empty progress mapping 2, which must be generalized in the proof (as P changes in 
the above rule). Let 6 be a natural number and y be a monitorable formula with n = nfv ¢. 
The function the maps the optional value (x) to x and L to some unspecified value. 
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Theorem 1. (a) invar c0 Øn s9, y for the initial state 3 (b) Suppose that Sọ satisfies 
invar © j @n Sy y and that dbs contains all relations from o for the indices in the list 
js — [j ..« j+6]. Then (xs, s;,) = eval (j--0) n (map (7) js) dbs s, satisfies invar © (j+ 
ô) Ø n s, p, and the i-th table in the list xs, for prog 7 Ø  j X i< prog © Ø q (j +0), 


contains (only) all tuples v of length n satisfying sat o (map the v) wig. 


Soundness follows immediately from Thm. 1, whereas completeness additionally re- 
quires the aforementioned property that any progress can be reached by making the trace 
prefix long enough, which we also proved for our modified progress function: 


Theorem 2. /f mon y, then for all i there exists a j such that prog 0  q j >i. 


4 Past-Recursive Let Operator 


Itis well-known that first-order logic (FOL) cannot express certain queries, notably the 
transitive closure of a binary relation. This remains true when restricted to finite struc- 
tures [18]. Although MFOTL is rather different from ordinary FOL, we conjecture that 
it cannot express transitive closure either. This hampers its ability to model hierarchies 
of unbounded depth. Moreover, recursive patterns are sometimes the most natural way 
to express certain specifications. We describe an extension of MFOTL that can encode a 
"temporally directed" form of transitive closure and other recursive patterns. 
Specifically, we introduce another let operator in which the predicate may refer to 
itself recursively. The intended semantics is that of a fixpoint, i.e., the predicate p defined 
by a formula o should be interpreted by a temporal relation that is equal to the evaluation 
of a under that interpretation of p. The fixpoint might not always exist or it might not 
be unique. Therefore, different fixpoint operators have been studied in the context of 
nontemporal logics and query languages [1]. For instance, it is common to require that all 
recursive occurrences of p in its defining formula are positive, i.e., under an even number 
of negations. This ensures monotonicity and hence the existence of a least fixpoint. 
MFOTL’s future operators are interpreted over infinite traces. This poses a new chal- 
lenge for monitoring recursively defined predicates, even if we restrict our attention to 
positive formulas. Consider the recursive definition of p by q V Ojo; p, where q is a pred- 
icate from the trace. Although q V Ojo; p is monitorable (at most one additional time- 
point must be known to evaluate it), the recursive definition of p is equivalent to 0 [0,04] q 
under the least fixpoint semantics. However, O19... q is not monitorable, as one might need 
the entire, infinite trace to evaluate it. Therefore, we focus on a fragment where every re- 
cursive occurence of p must be strictly in the past. This guarantees a unique fixpoint even 
if the defining formula is not monotone, so the predicate may occur negatively as well. 
The syntax of our past-recursive let operator is similar to the one of Let: we add the 
constructor LetPast string := frm in frm to the frm datatype. However, the semantics is 
different (Section 4.1). The restriction to strictly past recursion is enforced by a syntactic 
monitorability condition that is checked by mon. Consider the formula LetPast p := 
a in $. Intuitively, every recursive occurrence of p in œ must be guarded by at least 
one strictly past operator, and there must be no future operator on the path from the 
occurrence to a’s root. We do allow future operators in the other parts of œ, though. 
We give examples of LetPast in Section 4.2. The evaluation of LetPast requires an 
extension of VeriMon's algorithm (Section 4.3), which we also formally prove correct. 
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datatype recSafety — fun slp :: string => frm = recSafety where 
U|P|NFIA slp p (q(as)) = (if p = q then NF else U) 
fun (*) :: recSafety > slp p (Let q = a in B) = 
recSafety => (slp q B*slp p a) Li (if p = q then U else slp p B) 
recSafety slp p (LetPast q := a in 8) = 
where (if p = q then U else (slp q 8 *slp p a) Uslp p B) 
Ux_=U slp p (t1 ot) =U | slp p (v € Q(rb) e) = slp py 
| «U-U slp p (^e) =slp p e | slp p (Sy) =slp py 
| Ax_=A slp p (a V B) =slp pauslp p B 
| *A-A slp p (v ^B) = slp p aUslp p B 
| P slp p (679) =Pxslp py | slp p (Ore) =Axslp py 
| «PSP slp p (a S; B) = slp p aL ((if 0 Ez I then NF else P) «slp p £) 
| NF «NF = NF slp p (a Ur) = Ax (slp p a'Lislp p B) 


Fig. 2. Auxiliary definitions for the syntactic restriction on LetPast 


4.1 Semantics 
The semantics of the past-recursive let operator is defined by the equation 
sat © v i (LetPast p :— a in £) — sat (o[p= recp (AR j. satrel (o[p > R]) ja)]) vig 


We evaluate £ at the same time-point i as the recursive let operator using an appropriately 
updated trace. The temporal relation assigned to p is computed by the combinator recp: 


fun recp :: ((nat = data list set) => nat => data list set) > nat — data list set where 
recp f vi— f (Aj. if j « ithen recp f j else (]) i 


The argument f is a function that transforms temporal relations, and recp f returns again 
a temporal relation. Intuitively, recp f evaluates to the fixpoint f (recp f), except that 
f R i can only access time-points of R before i. For all other time-points j > i, the relation 
R j is empty. The combinator recp is well-defined because i is a natural number; the 
recursive call recp f j affects the result only if j <i and hence we can prove termination 
using i as a variant. For the semantics of LetPast, we choose f R i = satrel (o[p > R]) ia, 
i.e., the satisfactions of œ with p mapped to f's argument R, to which recp supplies the 
result of the recursive evaluation (up to but excluding i). 

Our definition of sat is total: it gives meaning to every formula. This includes for- 
mulas LetPast p :— a in B where p occurs in a without a past guard or under a future 
operator. However, the semantics behaves unexpectedly in such cases. For example, 
LetPast p := (q V O [o,e] p) in p is equivalent to q. Our monitor therefore requires properly 
guarded formulas. Not only does this avoid confusion about the semantics, it also simpli- 
fies the implementation because the monitor need not eliminate unguarded occurrences. 

Next, we describe the formalization of the syntactic restriction. The idea is to deter- 
mine for every predicate whether it is used strictly in the past by analyzing the formula 
recursively. The datatype recSafety (Fig. 2) represents the possible outcomes. U(nused) 
means that a predicate does not occur in the formula. P(ast) means that it is evaluated 
at strictly earlier time-points, whereas NF (Non-Future) additionally allows the current 
time-point. A(ny) covers all remaining cases. The linear order < on recSafety is induced 
by U < P < NF < A. Its reflexive closure < corresponds to implication. For example, if 
the predicate p is unused (U), it is clearly evaluated at earlier time-points only (P). The 
least upper bound xU y with respect to < corresponds to logical disjunction. 
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The function slp p y (Fig. 2) analyzes the past-guardedness of a predicate p in a for- 
mula g. It uses a composition operator y * x on recSafety. The patterns in the definition of 
* should be matched sequentially from top to bottom; e.g., A* U is equal to U. Intuitively, 
y * x describes the guardedness of a predicate that is x-used in some subformula, which 
is then y-used. For example, slp p (6; y) = P «slp p ọ because q and all occurences of p 
therein are evaluated at time-points that are strictly in the past relative to 6; y. Note that 
we make a case distinction for a'S; P: if the interval / excludes zero, B is always evaluated 
strictly in the past. Future operators always result in A if p is used in an operand. 

Finally, we define the mon predicate for the recursive let operator: 


mon (LetPast p :— a in B) = (slp pa < PA (0 ..« nfv a} € fv œ Amon «^ mon £) 


The only difference to Let is the restriction of p's occurrences in a via slp, which is 
generally an over-approximation. For example, slp p (6; 6€; O; p) = A even though p is 
evaluated at strictly earlier time-points. Therefore, some instances of LetPast that our 
algorithm could evaluate correctly are not considered to satisfy mon. We plan to replace 
recSafety with a more precise lattice in future work. 


4.2 Examples 


Temporal Operators. We first show that the non-metric S operator can be reduced to 
LetPast and @. (We omit the interval subscripts if the interval is |0,99].) Using the special 
ts(t) predicate, which is true iff ¢ is the current time-stamp, we can also express the metric 
version. This example serves to gently illustrate the semantics of LetPast. In general, for- 
mulas are more readable if they are directly expressed in terms of S, and monitoring can 
be more efficient. Below we give further examples in which LetPast adds expressiveness. 

Let o and f be two monitorable MFOTL formulas with free variables fv o and fv £, 
respectively. The formula o S is monitorable only if fv a C fv £, so let us assume that, 
too. The following unfolding of S's semantics is well-known: 


sat o vi(aSg) <=> satovifiV(satevia^i»0^satov(i—1)(aSB)) (1) 


As the unfolding recursively evaluates the formula at the previous time-point, we can 
directly translate it into a recursive let operator: ys = LetPast s(x) :— v in s(x), where 
y =BV (a ^ es(x)). The predicate name s must be fresh, i.e., it must not occur in œ nor 
p. The variable list x enumerates fv 8. The formula ys is monitorable because @ s(x) is 
clearly past-guarded, and hence slp s y = P. (We also need fv B = {0 ..< nfv B), which 
can be achieved by renaming variables in a and f.) Let us analyze the semantics of gs: 
sat o v i ps <=> sat (o[s 2 recp (AR j. satrel (o[s > R]) j v)]) v i (s(x)) 
$$ 


N. 
€— verecp fy i =fy 
<=> sat (o[s 2 Aj. if j < ithen recp fy j else (M) viv 


ey sato viBV(satevia^i»0 


^v € (if i— 1 < i then recp fy (i— 1) else {})) 
<=> sato vifV (sata. via ^i» 0Asato v (i— 1) gs) 


These equations hold for all valuations v of length nfv £ and if the variables x are ordered 
by their De Bruijn indices. Step (*) exploits the freshness of s with respect to a and £, 


Verified First-Order Monitoring with Recursive Rules 245 


which allows us to replace o[s > ...] by o. The equations result in the same unfolding 
as (1). Hence, we can prove the semantic equivalence of ys and aS by induction on i. 
The following SinceLet formula encodes a Sq] p. Other encodings exist, however. 


LetPast s(x, f) :— (BAts(t)) V (e^ &s(x.t)) in 3t, u. s(x.t) Ats(u) ^a &u—t^u—t € b 


Here, t and u are fresh variables, where t records the time-stamp of the past satisfaction of 
P. whereas u is the time-stamp at which we evaluate SinceLet. The subformula a < u — t ^ 
u —t < b corresponds to T & j- T c i €z [a,b], which is part of Sj, j's semantics (Fig. 1). 


Temporally-Directed Transitive Closure. We proceed by showing that LetPast can 
compute a temporally-directed transitive closure over events observed at a sequence of 
distinct time-points. Hence, we assume that the trace contains a single event at every 
time-point. The closure is directed in the sense that the transitive chains can only be 
extended by newer events. We consider the following two types of events from [14]: 
r(y,x,d) denotes that process y reports some data d to another process x, and s(x,y) 
denotes that process x spawns process y. The Spawn formula 


LetPast p(u,v) :— s(u,v) V (&p(u.v)) V (3t. (&p(u.t)) As(t,v)) in r(y,x.d) ^ ^p(x.y) 


encodes violations of the property that whenever process y sends some data d to a process 
x, denoted as r(y, x, d), then there was a chain of process spawns: s(x,x1),s(x1,x2),.... 
s(xk,y), occurring in this order in the trace. In other words, a process may only send data 
to its "ancestors". To check this property, a monitor needs to compute the (temporally- 
directed) transitive closure p(u, v) of the relation s. The definition of the closure has two 
recursive predicate instances with different arguments. The Spawn formula is inspired 
by a similar one used to evaluate the DejaVu monitor [14]. Unlike DejaVu, we do not 
require the formula to be closed and thus leave the variables x, y, and d free. 
The Trans formula 


LetPast p(u,v) := s(u,v) V (@p(u,v)) V 
(3t. (@p(u,t)) ^s(t,v)) V (At. s(u,t) ^ (@p(t,v))) V 
(3t, (@p(u,t)) As(t,) A\(@p(t,v))) in r(y,x d) A7p(x,y) 


encodes violations of the same property as Spawn even if s(x,x1),s(x1,x2).....s(xi.y) 
are received by the monitor out-of-order, i.e., they do not occur in this order in the trace. 

We can interpret the events s(x, y) as edges in a directed graph and the predicate 
p(x,y) in Trans as computing the reachability of vertices in the directed graph. We also 
extend the directed edges s(x, y) with a weight w to st (x, y, w). Then the Trans* formula 


LetPast p(u,v,w) := st(u,v,w) V (@p(u,v,w)) V 

(Jt, w1,wo. (@p(u,t,wi)) As*(t,v,w2) Aw = wi +w2) V 

(St,w1,W2. st (u,t,w1) A (p(t,v,wa)) Aw = wi +w2) V 

(at, /,w1,W2,w3. (@p(u,t,w1)) Ast (tt wa) ^ (@p(t',v,w3)) ^ 
w = wy, +w2+w3) in 

Let m(u,v,w) := w + MIN(w;unv).p(u,v,w) in m(x,y,w) ^-(&m(x,y.w)) 


yields all pairs of vertices x, y and the length w of the shortest path from x to y whenever 
y becomes reachable from x or the length of the shortest path changes. The relation 
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st (x, y, w) can itself be obtained by evaluating a more complex temporal formula, e.g., 
s+ (x, y,w) & eG y,w) ^= 010,19, d(x,y) with the following two types of events: e(x, y, w) 
denotes an edge from x to y with weight w; d(x, y) denotes deletion of the edge from x 
to y. The eventually operator Q; abbreviates (4x. x = x) Ur e. Such a relation s* (x, y, w) 
contains all edges that are not revoked within 10 time units after receiving e(x, y, w). We 
could use the non-recursive let operator Let s* (x, y,w) :— e(x,y,w) ^ ^ Oto,ioj d (x. y) to 
precompute the relation and use it when evaluating the recursive let operator in Trans. 

As another application of future operators under LetPast, recall our introductory 
example. Suppose that hosts in a network communicate with each other and with the out- 
side world: comm(src, dest) indicates that host src sends a message to host dest; in(r,h) 
and out(h, r) indicate that the host h receives or sends traffic from or to an IP address in 
the range r, respectively. The hosts are equipped with an intrusion detection system (IDS), 
whose alerts are denoted by ids(h). We say that a host / is tainted by an address range r iff 
there is a chain of communication from r to h and all hosts on the chain (including A) trig- 
ger an IDS alert within one hour after communicating with the previous host. The formula 


LetPast taint(r,h) :— ((in(r.h) VAN’. (@ taint(r,h')) ^ comm(h', h)) ^ Oy ids(h)) V 
(@taint(r,h)) in taint(r,h) ^out(h,r) 


is true whenever a host communicates back to the IP range by which it was tainted. 


Periodic Behavior. Suppose that we monitor a boolean signal b(x), parametrized by an 
integer parameter x, between the user's start(x) and stop(x) commands. An arbitrary 
amount of time may pass between these two commands. Our task is to detect periodic 
activations of b(x), with a fixed period t > 0 and error tolerance 0 < e < t. We shall 
ignore positive noise in b(x), i.e., additional activations besides the periodic ones. 

Let us make the task more precise. An alarm must be raised at time-point i, iff 
there exist time-points io < ij <--- <i, such that start(x) holds at ip, stop(x) holds at ij, 
and b(x) holds at all ip for 1 < k X n — 1. Moreover, the difference of time-stamps for 
adjacent time-points i and ik+1, where 1 < k X n— 2, must be in the interval [t — £,t + £]; 
the differences for the pairs io, i; and i,_1, i, must each be at most t+ €. 

Our first attempt PB to formalize the alarm condition without recursion is 

stop(x) ^ (&i(start(x) V b(x))) ^ ((b(x) — ($istart(x)) V (; b(x))) Sstart(x)) 
where J = [0.1 4- £], J = [t — £,t +£], and 4x y abbreviates (Ax. x = x) Sg y. This formula 
follows an inductive approach: every b(x) between start(x) and stop(x) must be preceded 
by b(x) or start(x), with the appropriate time difference. However, PB does not ignore 
noise, as adding b(x) events to the trace may silence an alarm. For example, let t = 10, £ = 
0, and c be a trace starting with ({start(1)},0), ({b(1)}, 10), ({stop(1)},20). We write 
(p(1), p(2)} for the database where the predicate p holds for 1 and 2. On ø, PB is true 
at the third time-point. Inserting a database {b(1)} with time-stamp 15 falsifies PB at the 
now fourth time-point, although the trace still satisfies the natural language description. 

The following PBLet formula expresses the intended condition using LetPast: 

LetPast periodic(x) := start(x) V (b(x) ^ ((jstart(x)) V ($;periodic(x)))) in 

stop(x) ^ ,;periodic(x) 


This example depends crucially on the flexible past guards we support: here, the recursion 
goes through € with an interval constraint. Note that 0 ¢ J because we assumed e « f. 
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As another example of periodic behavior, we analyze an integer-valued signal(y) 
between the (now non-parametric) commands start and stop. We aim to discover whether 
signal(y) is piecewise constant, with the constant segments being exactly ¢ time units 
long. Moreover, the signal’s values for subsequent segments must differ by at most 6. The 
next formula uses the general S operator as the recursion guard to capture this property. 


LetPast segment(y) := Iz. signal(y) ^ ( ((@ signal(z)) Sjo, (signal(z) ^ @ start)) V 
((@ signal(z)) Sind segment(z))) ^ —ó <y—zAy—z<6in 
stop ^ Ay. ((@ signal(y)) Sp, segment(y)) 


Turing Machines. Every MFOTL formula can be viewed as a function on traces, where 
the function's output is the set of satisfying valuations, either at a fixed or at all time- 
points. VeriMon's monitorable fragment guarantees that one can compute the valuation 
at every time-point. Thus, monitorable formulas correspond to computable functions. If 
we give up on the requirement that the function's output must be available at a fixed time- 
point, the past-recursive let operator is expressive enough to simulate arbitrary Turing ma- 
chines (TM). This is not a contradiction: we simulate a single TM step at every time-point, 
and there is an infinite supply of time-points. Running the monitor on a configuration that 
does not halt will never produce an output, i.e., a nonempty set of satisfying valuations. 

Let M = (Z,b, Q. qo, qf.) be a deterministic TM with tape alphabet X, blank symbol 
b € X, control states Q, initial state qo € Q, final state qf € Q, and transition function 
6€(QxX > QxZx1(-1,0,1]). Whenever the machine is in state qı and reads the 
symbol sı, it enters state q2, writes the symbol s2, and moves the head by m tape cells to 
the right, where 6(q1,51) = (q2,52,m). Without loss of generality, we assume that X and 
Q are finite subsets of the integers. We simulate M using the formula yy shown below. 


LetPast cfg(q,i, s) ‘= 
Let cfg(q,i,s) = @cfg(q,i,s) in 
Let head(q, s) :— cfg(q,0,s) V (>(Ax,z. ofg(x,0,z)) ^ (Ay,z. cfe(q.y,z)) ^s — b) in 
(input(i,s) ^q = qo) V 
V (head(qi $1) ^a — qA (i= -mAs = s2) V 


q1,S1 


êlqis2)=(q2s2m) (IJ. cfelqi, jS) Aj ZO ^i — j-m))) in cfe(qy.i,s) 


The idea is that cfg represents the current configuration of the TM. Specifically, 
cfg(q.i, s) holds if the machine is in control state q and the tape contains the sym- 
bol s in the ith cell to the right of the head (i may be negative). Note that we use 
nested, non-recursive let operators to abbreviate repeated subformulas. In the body of 
Let cfe(q,i, s) = @ cfg(q,i, s) in ..., the predicate cfg refers to the previous configuration. 
The predicate head provides the current state and the symbol under the head. Its definition 
extends the tape by a blank symbol if necessary. The simulation is started at time-point 0 
by providing the tape's initial content in the predicate input, which must include the cell 
input(0, so) with the symbol so under the head's initial position. If and only if M halts 
on this input, there exists a time-point i at which yy is satisfied by at least one valuation 
(i, s). Moreover, the satisfying valuations at i represent the final state of the tape. 
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4.3 Algorithm 


The restriction to past-guarded recursion allows for an efficient evaluation algorithm for 
LetPast formulas. It is efficient because no fixpoint iteration is required at individual 
time-points. To evaluate LetPast p :— a in f, we first try to evaluate o for as many time- 
points as possible and then use the results to interpret p in B. This part is the same as for 
the non-recursive Let, but the evaluation of a itself differs. The syntactic monitorability 
condition guarantees that o at time-point i depends on the predicate p only for time- 
points strictly less than i. Specifically, we have defined mon (LetPast p := a in £) such 
that the progress of @’s evaluation does not depend on p’s progress beyond time-point 
i — 1. Therefore, we can evaluate a at time-point 0 without providing any table for p, 
then use the result to evaluate « at time-point 1, and so forth. 

There are two cases that require care. First, if œ contains future operators, multiple 
time-points may be evaluated at once. The above process must then be repeated within a 
single monitor step. Second, if o contains no future operators, œ is evaluated at all time- 
points i < j, where jis the current trace prefix length. We could then attempt to evaluate œ 
once more at time-point j using the table computed at j — 1 for p. However, this would not 
yield any further tables because all occurrences of p are below at least one past operator 
that tries to access the time-stamp at time-point j, which is not yet known. Therefore, 
this last evaluation attempt would needlessly traverse the formula state. We optimize this 
case and buffer œ’s result at time-point j — 1 until the next input database arrives. 

It is crucial that the evaluation of a recursive let does not get stuck waiting for tables 
that it needs to produce itself. Therefore, all operators that are strictly past-guarding as 
defined by slp (Fig. 2) must be well-behaved: the evaluation algorithm must compute a 
result at time-point i < j even if the operands’ results are available only for time-points 
i’ <i. In particular, S; without 0 in the interval is considered strictly past-guarding. We 
have modified VeriMon’s evaluation algorithm for o S; to achieve this behavior. 

The inductive state SLetPast p m Sq sg i buf for a recursive let operator extends 
SLet with a counter i :: nat, which tracks the progress of p as observed by sq, and an 
optional buffer buf :: table option. The meaning of the other arguments is the same as 
for SLet. In the initial state, i is zero and buf is L. Let the function list opt map L to 
[] and (x) to [x], where (x) is the embedding of x into the option type. A single monitor 
step updates the state as follows (see Section 3 for a description of eval's interface): 


eval j n tss dbs (SLetPast p m Sq sg i buf) = 
(let (xs, s',,i’, buf") = evalip j m tss dbs p [] Sq i (list. opt buf); 
(vs, sp) = eval j n tss (dbs|p — xs]) sg 
in (ys, SLetPast p m si, sg i buf")) 


The heavy lifting is performed by eval_p, which is mutually recursive with eval. We 
forward relevant variables from eval. The accumulator xs :: table list collects ss results. 


evali p j m tss dbs p xs Sq i buf = 
(let (xs', s4) = eval j m tss (dbs[p ++ buf]) sa; i! = i+ length buf 
in (case xs’ of [] = (xs. 52. 7^, L) 
| x#_= (ifi +1 > jthen (xs Q xs’, s^, i^, (x) 
else eval, p m j [] (clear. dbs dbs) p (xs @ xs") st, i’ xs'))) 
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First, evali p evaluates s, with dbs updated at p using the current buffer, which may 
be empty. Since i tracks p's progress, we then increase its new value i’ by the length 
of buf. The evaluation results in a list xs’ of tables and a new state s/,. We continue to 
iterate eval_p only if two conditions are met: xs’ must be nonempty, as otherwise there is 
no new data to evaluate s^, on, and i’ + 1 must be less than the current input prefix length. 
The latter condition serves as an obvious termination criterion, although it is stricter than 
necessary. We could perform an additional iteration in the case that i’ + 1 = j. However, 
such an iteration would never produce new results because the past operators guarding p 
can only be evaluated further if there are new time-stamps. Therefore, we optimize this 
case by choosing the stricter condition. If we continue the iteration, we append xs’ to the 
accumulator xs. Moreover, we clear tss and dbs because all tables from the new input 
database have already been processed by the first call to eval. Specifically, the function 
clear_dbs dbs updates dbs at all points at which it is defined to an empty list. 

We illustrate our algorithm with an example, tracing the computations of eval and 
eval_p. We evaluate LetPast p(x) := q(x) V @p(x) in p(x), which has the same semantics 
as $10. q(x), on a prefix with two time-points at time-stamps 0 and 3. We omit details 
about the subformulas' states, as well as brackets around singleton lists, i.e., [1] is 
displayed as 1. Let dbso = {q > [{1}, {2}]} be the content of the trace prefix. 


eval j:2 n:1 iss:[0,3] abs:dbso s,:(SLetPast p 1 ao Bo 0 L) 


eval) p Y m: i55: [0,3] dbs:dbso p.p xs:[] me 40) i buf:|] 
eval j2nl i55: [0,3] dbs:(dbso|p — D 55:0! = (({1}],a1) 
eval, p j:2 m1 tss:[] dbs:{q b> {|} p:p xs:[{1}] Sai] i:0 buf [{1}] 


| eval j:2 nl tss:[] dbs:{ p b> ((1]]q => 55:0] (1,21],a2) 

| iteration stops because i’ = 1 and hence i 4-122» j 22 

= ([{1}, {1,2}], a2, 1, ({1,2})) 

(13. (1.2)],22.1, ((1.2))) 

eval j:2 n:1 iss:[0,3] aos (dbso[p > [{1},{1,2}]]) s«:Bo = ([{1}, {1,2}], 62) 
= ([{1}, {1,2}], SLetPast p 1 o» 62 1 ({1,2})) 


Correctness. We extended the correctness proof of eval (Thm. 1) to cover the new state 
constructor SLetPast. The added case differs from the one for the non-recursive let in 
that eval, p is used to evaluate the first subformula. The proof also required additional 
invariants for the i and buf arguments of SLetPast, as well as a characterization of 
LetPast’s progress. Recall that progress describes the number of time-points that the 
monitor is able to evaluate given a trace prefix of length j. We express the progress of 
the let-bound predicate p, which is defined in terms of a, as a least fixpoint: 


Il 


progip c Ppa j-| i. i= proge (Pp i o j} 

prog c P (LetPast p := « in B) j = prog o (P|p > progip oc P pa jl) B j 
(We do not update o in these definitions as progress depends only on the time-stamp 
sequence but not on the databases in c.) The above characterization follows the iteration 
in evali p: Since prog is pointwise monotone in P and at most j (both facts we prove in the 
formalization), the fixpoint can be reached by iteratively computing prog c (P[p  i]) a j 
starting with i = 0. Similarly, eval, p starts by evaluating o with no data for p and it feeds 
the results back into the evaluation until no further results can be obtained. Theorem 2 
remains true after adding the above equation to prog. 
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The state invariant for SLetPast is given by the rule 
invar (o[p > recp (AR k. satrel (o[p > R]) k a)]) j (P[p > i]) m sa æ 
invar (o[p => recp (AR k. satrel (o[p > R]) k a)]) j (P[p  progip o P pa j|) n sg B 
buf = L — i = progıpo P pa j 
(YZ. buf = (Z) — i+1 = progıpo P pa j 
A table m (fv œ) (recp (AR k. satrel (cp => R]) k æ)) Z) 
m=nfva slppa<P {0.<m}Cftfva 
invar © j P n (SLetPast p m Sq sg i buf) (LetPast p = a in 8) 


The first two premises use the same updated trace as in the semantics of LetPast 
(Section 4.1). The updated progress for p differs slightly between the premise for sy and 
that for sg. For the latter it is given by prog) p, as expected. The predicate p’s progress 
within sq is equal to the state variable i, which is one less than prog; p c P p a j if the 
buffer buf is nonempty. This reflects to the optimization discussed in Section 4.3. The 
predicate table A n R Z is true iff the table Z contains tuples of length n that assign values 
to variables A and they are exactly the tuples of this kind satisfying map the v c R. 


5 Evaluation 


We have used Isabelle/HOL’s code generator [12] to export a certified implementation of 
VeriMon's core init and step functions and every function those depend on (e.g., opera- 
tions on red-black trees), which amounts to about 10000 lines of OCaml code. VeriMon 
augments this generated code with unverified parsers and pretty-printers. We evaluate 
this implementation to answer the following research questions: (1) How does VeriMon 
perform when monitoring formulas with the recursive let operator?; and (2) How does it 
compare to existing monitors for temporal first-order specifications with recursive rules? 
To answer these questions, we run VeriMon and DejaVu and benchmark some of 
the example formulas introduced in Section 4.2. Instead of SinceLet, we opt for the 
simpler OnceLet = LetPast o(u,v) :— s(u,v) V @ o(u,v) in filter(x, y) ^ o(x, y) encoding 
the non-metric @ operator. We also include Once = filter(x,y) ^ & s(x. y) for comparison. 
The predicate filter(x, y) keeps the output size small. The OnceLet formula uses only 
one recursive predicate instance, whose variable order matches the one in the predicate's 
definition. Other formulas have more than one instance with different variable orders. 
For the PBLet formula, we use an existing random trace generator [17] configured 
to pick parameters from a small integer domain, which increases the probability of 
producing satisfactions. For the other formulas, we generate traces using a similar strategy 
to the one used in DejaVu's benchmarks on the Spawn formula [14]. Namely, edges of a 
tree of spawned processes with a configurable branching factor are linearized into a trace, 
level by level. In the final level all edges converge to a single node for the formulas Trans 
and Trans‘. We define the edges by Let s*(x,y,w) :— e(x,y,w) ^ 0t] dx. y) in the 
Trans* formula and revoke one half of the edges on the second level of the branching. 
We have executed our experiments on an Intel Core 15-4200U CPU using 8 GB 
RAM. Initially, DejaVu crashed on the OnceLet and Spawn formulas. We investigated 
the issue and found that its formula's abstract syntax tree was disconnected in these cases. 
We assume that this is caused by naming variables in the recursive rules' definitions 
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Trace Once OnceLet Spawn Trans Trans PBLet 
length VeriMon DejaVu VeriMon DejaVu VeriMon DejaVu VeriMon DejaVu VeriMon VeriMon 


100 0.0 1.1 0.0 1.1 0.6 1.5 1.3 3.7 5.6 0.0 
200 0.0 1.2 0.0 1.2 3.1 2.1 6.1 8.1 25.9 0.0 
400 0.0 1.3 0.0 1.3 14.0 3.4 28.3 23.6 117.4 0.0 
800 0.0 1.5 0.0 1.4 64.8 8.2 TO 83.4 TO 0.0 
4000 0.2 413 0.1 40.5 TO TO TO TO TO 0.1 
8000 0.4 TO 0.2 TO TO TO TO TO TO 0.1 
10000 0.5 TO 0.3 TO TO TO TO TO TO 0.2 


Fig. 3. Execution times of the monitors in seconds (TO = timeout of 120 seconds) 


differently from those in the rules’ usages. After renaming the variables in the let-bound 
predicates of these two formulas, the issue was fixed and we restarted the experiments. 
The evaluation results (Figure 3) show that DejaVu's performance is incomparable to 
VeriMon's. VeriMon outperforms DejaVu on the formulas Once and OnceLet and scales 
well on PBLet, which, together with the Trans* formula, we could not express in PFLTL 
with recursion. DejaVu outperforms VeriMon on the Spawn and Trans formulas for which 
VeriMon's time complexity of processing one event is linear in the trace length because 
the number N of valuations satisfying the recursive predicates grows linearly in the trace 
length and the time complexity of updating the recursive predicate is linear in N. We 
conjecture based on some preliminary experiments that VeriMon's performance can be 
significantly improved by optimizing the representation of sets of tuples in two ways: (a) 
using tuples of a fixed length with a fixed assignment of variables to positions in a tuple 
(i.e., no De Bruijn indices); (b) using a collection of indices to optimize the computation 
of joins on various sets of shared columns. Nevertheless, processing one event can 
unlikely be made trace-length independent: Trans encodes the incremental dynamic 
transitive closure graph problem, with the best known algorithm processing every new 
edge in the input in amortized linear time (in the graph's maximum out-degree) [23]. 


6 Conclusion 


We have presented the extension of a monitor for MFOTL with non-recursive and past- 
recursive let operators. The presence of bounded future temporal operators complicates 
both the semantics and the evaluation algorithms for the new constructs, compared to 
earlier unverified extensions of past-only monitors [14]. Yet, the formal correctness 
proofs that we have carried out ensure the trustworthiness of our development. 

As future work we plan to improve the performance of evaluating expensive joins 
by introducing indices, as used in database management systems. Expressiveness-wise 
we will consider further relaxing the requirements on the recursive let. We can omit the 
past guard if we define a Datalog-style fragment for which the fixpoint is well-defined. 
Beyond relaxing guards, we may want to allow recursion through future operators 
in certain situations. The main challenge is that this would make the progress notion 
data-dependent (unlike currently, where it only depends on the time-stamps). 
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Abstract. State-of-the-art solvers for constrained Horn clauses (CHC) 
are successfully used to generate reachability facts from symbolic encod- 
ings of programs. In this paper, we present a new application to test-case 
generation: if a block of code is provably unreachable, no test case can be 
generated allowing to explore other blocks of code. Our new approach 
uses CHC to incrementally construct different program unrollings and 
extract test cases from models of satisfiable formulas. At the same time, 
a CHC solver keeps track of CHCs that represent unreachable blocks 
of code which makes the unrolling process more efficient. In practice, 
this lets our approach to terminate early while guaranteeing maximal 
coverage. Our implementation called HORNTINUUM exhibits promising 
performance: it generates high coverage in the majority of cases and 
spends less time on average than state-of-the-art. 


1 Introduction 


Branch coverage is a method for testing that aims to maximize the number of 
program branches to be collectively visited by a set of test cases. Branches in the 
code are commonly attributed to the conditional statements or loops. For testing 
a loop-free program, possible test cases for all the branches can be identified by 
symbolic execution, powered by efficient solvers for Boolean Satisfiability (SAT) 
or Satisfiability Modulo Theories (SMT). If a conditional is placed inside or after 
a loop, test-case generation immediately becomes challenging because the cost 
of exploration of every next iteration grows exponentially in the worst case. 

Many verification problems can be reduced to synthesizing interpretations of 
predicates in systems of SMT formulas, also known as constrained Horn clauses 
(CHC), that provide a modular encoding for programs with arbitrary control 
flow. In this paper, we propose to use CHC also for test-case generation. So- 
lutions to CHC, also called inductive invariants, carry reachability information 
and are useful in pruning the search space explored by test-case generators. If an 
invariant shows that a branch can never be taken, then it is guaranteed that no 
test can ever reach the branch, and thus a test-case generator can safely proceed 
to discovery of the next test case. 

We contribute a new approach to test-case generation that aims at maximiz- 
ing branch coverage using inductive invariants. In essence, our approach gradu- 
ally enumerates different unrollings and uses an off-the-shelf SMT solver to get 
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values for program variables that represent test cases. Unrollings are constructed 
on-the-fly by exploring the CHC encoding of programs. Concurrently, an incre- 
mental CHC solver determines a subset of unreachable CHCs which allows the 
algorithm to explore fewer unrollings in the next iterations. The algorithm ter- 
minates when the test cases are generated for all reachable branches, but all the 
remaining branches are provably unreachable. 

These features distinguish our approach from other white-box test genera- 
tors [1,8,9] that consider reachability information only in the bounded context. 
'That is, in the presence of unreachable branches and loops, they may continue 
iterating forever, even if all possible test cases have already been generated. 
Reliance on invariants lets our tool to terminate early while still guaranteeing 
maximal possible coverage. 

The approach has been implemented on top of the FREQHORN CHC solver [14] 
and the Z3 SMT solver [27]. It enables test-case generation for C programs, con- 
verted to CHCs by the SEAHORN [21] tool. Experiments conducted on a range 
of public benchmarks demonstrate the strengths of our approach compare to 
state-of-the-art: SMT-based incremental test-case generation is able to detect 
high-quality solutions in the majority of cases and is on average less expensive. 


2 Related Work 


Automated test generation has two main approaches: fuzzing (e.g., [7,20,25, 26, 
29, 31,33,34]) and symbolic/concolic execution (e.g., [3,8, 11,22, 23, 28,32]). The 
former group uses user-given seed inputs and further mutates them based on var- 
ious heuristics (sometimes using the source code as well). The latter group, which 
also includes our one, proceeds by enumerating paths and generating test cases, 
often using constraint solvers. Recent algorithms, including FUSEBMC [1] and 
VERIFUZZ [9], follow both approaches: begin with symbolic execution (namely, 
some bounded model checking [10,19]) and then proceed to fuzzing. 

The closest related work [22] suggests to accelerate testing using interpola- 
tion. Aiming the same goal as us, i.e., prune unreachable paths, they however do 
not generate inductive invariants, which limits the generality of their method. 

Earlier attempts to combine static analysis techniques and testing [11] were 
tailored to particular frameworks and languages. With the rise of SMT solvers, 
approaches became more scalable, goal-oriented [3], and at the same time more 
agnostic to programming languages. Recent works, e.g. [33], offer a great flex- 
ibility of applications of static analyzers to test-case generation, e.g., to direct 
fuzzers to specific blocks of code. Following this trend, our approach continues 
bridging the gap between state-of-the-art in automated reasoning and testing. 

While we are not aware of any specific applications of CHC solvers to test- 
case generation, we are largely inspired by the work in model checking, e.g., [6,21] 
that can both discover invariants and find counterexamples (from which a test 
case can be extracted). The main difference is in the application: model checkers 
often focus on a single property/bug, while our goal is to cover the maximal 
number of branches. Furthermore, many practical approaches including [1, 9] 
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1 int x = 0; 

2 int y = nondet (); 

3 int z = nondet(); 

4 while (1) { 

5 if (€ >= 5) 

6 ytt; // needs at least 6 iterations to reach 
7 else 

8 xt; // v € [0,5] always holds 
9 if (y <= 5) 

10 Z4: 

11 else 

12 if (x > y) 

13 ytt; // this is unreachable 
14 else 

15 x = 0; 

16 if (z == 0) 

17 break; 

18 } 


Fig. 1: Loopy program with control-flow divergence and unreachable branches. 


are based on existing model checkers (that typically use constraint solvers as 
blackbox), CHC formulation allows to build tools modularly and directly on top 
of an SMT solver, thus allowing to use it incrementally for both counterexample 
finding and invariant generation. 


3 Motivating Example 


Fig. 1 gives a program with a single loop. It has three variables: x is assigned 
zero before the loop, which we cannot change, and the remaining y and z could 
be taken from the user. The loop has four if-then-elses (including one nested), 
and it terminates when the value of z at the end of an iteration equals zero. To 
completely cover all the branches, we need to consider seven cases, in particular: 


line 6: 


line 8: 


line 10: 


line 13: 


line 15: 


In order to reach the first then-branch, the loop needs to iterate at least 
six times and do not reach lines 15 and 17 at the first five iterations. 
Thus, line 8 should be visited at the first five times. A possible scenario 
for that would be if initially y = 0 and z = 0. 

The loop always reaches the else-branch of the first conditional be- 
cause initially x is zero, and the guard trivially does not hold. 

The guard of the second conditional might hold even at the first itera- 
tion if y is sufficiently small. Since we know that the increment at line 
6 does not happen at the first iteration, y might initially be 5 (and z 
any). Then the branch is reachable. 

The branch is never reachable because 0 < x < 5 is a loop invariant 
(and thus, holds at each iteration) and the path condition x > yAy > 5 
is unsatisfiable. 

Because line 13 is unreachable, we know that line 15 is reached always 
if the guard of the second conditional does not hold, e.g., when y is 
initially greater than 5. 
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line 17: If initially z = 0 and y is greater than 5, the loop executes the break 
statement at the end of its only iteration. 

line 19: We have already seen a test case (for line 6) that gives a possible 
condition for the loop to continue iterating. In fact, for any values of 
z greater than zero (and any values of y), the loop does not terminate 
at all. 


All these make the program quite interesting and its analysis challenging. 


4 Background 


This paper approaches the problem of automated test-case generation by re- 
duction to the Satisfiability Modulo Theories (SMT) problem. Automated SMT 
solvers determine the existence of a satisfying assignment to variables (also called 
a model) of a first-order logic formula. Formula y is logically stronger than for- 
mula 1» (denoted y => v), if every model of y also satisfies v). The unsatisfi- 
ability of formula y is denoted y = > L, and we also write M € Ø to indicate 
that no model M of the formula (which is clear from the context) exists. By 
writing (x), we denote a predicate over free variables x. 

Constrained Horn clauses (CHC) are used as intermediate verification lan- 
guage used by both verification frontends and backend SMT solves. This allows 
to split efforts while designing a verification tool for a new language: while fo- 
cusing on encoding programs to CHCs, researchers rely on advances of CHC 
solvers that will solve these CHCs. Thus, by demonstrating our algorithms at 
the level of CHCs, we allow for many their particular instantiations for various 
programming languages (that support CHC encoding). 


Definition 1. A linear constrained Horn clause (CHC) over a set of uninter- 
preted relation symbols R is a first-order-logic formula having the form of either: 


pai) => 4nvi(zi) 

inv;(z;)^q(zx;2;) = inv; (Zj) 

iNVn (Ln) A plih) => L 
where all inv; € R are uninterpreted symbols, all x; are vectors of variables, 
and q is a fully interpreted formula called constraint. 


'These types of implications are called respectively a fact, an inductive clause, 
and a query. Note that constraint yc of each CHC C does not have applications 
of any predicates from R. Further, by body(C), we denote the premise of C, 
by src(C) an application of inv € R in body(C) (but if C is a fact, we write 
src(C) = T). Similarly, by head(C), we denote the conclusion of C, and by 
dst(C) we denote an application of inv € R in head(C) (and if C is a query, we 
write dst(C) = 1). 

Intuitively, CHCs allow to generate program encodings with “holes” that 
represent unrollings of unknown lengths. Then, possible instantiations of these 
holes can be used in the discovery of meaningful information about the program, 
such as loop invariants, or function summaries. 
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Definition 2. Given a set R of uninterpreted predicates and a set S of CHCs 
over R, we say that S is satisfiable if there exists an interpretation for every 
inv € R that makes all implications in S valid. 


CHCs are useful also when there is a need to access various pieces of program 
encoding and pose reachability queries. In particular, it is straightforward to 
design a Bounded Model Checking (BMC) [5] tool on top of CHCs and use it for 
test-case generation. Specifically, by traversing the graph structure imposed on 
the CHCs, we can access all possible program traces and create the corresponding 
unrollings. 


Definition 3. Given a system S of CHCs over R, an unrolling of 5 of length k 

tr) = A. vo (Xi miri), such that 1) Co is a fact, 2) 
0<i<k 

each C; € S, 3) for each pair C; and C;,1, rel(dst(C;)) = rel(src(C;,1)), and 


variables of each x; are shared only between qo, ,(Xi-1,23;) and yo, (Xi, viii). 


is à, conjunction T (6, 


For bug finding, it is essential to enumerate various unrollings and check their 
satisfiability. Once a satisfiable formula 70,.....c,) is found for some query Ck, 
the bug is found (and its counterexample can be obtained from the model), and 
thus no interpretation for predicates in R exists. 


Lemma 1. Given a system of CHCs S, let Tic... c) be one of its unrollings, 
such that Co is a fact, and Cy, is the query. Then if Tico...) is satisfiable then 
S is unsatisfiable. 


In the next section, we expand on the notions of CHCs and unrollings, give 
examples, and present the application to test-case generation. 


5 Test-case Generation for Branch Coverage 


The concept of constrained Horn clauses is convenient for formulating the prob- 
lem of constructing a maximal branch coverage (MBC) of a given program. At 
the highest level, the problem of finding MBC is concerned with finding a set of 
program executions that visit all reachable program branches. Given the CHC 
encoding of the program, this can be reduced to a problem of finding a set of 
satisfiable unrollings that involve the maximal number of CHCs. However, to 
guarantee maximality, this needs a special property of the CHC encoding: the 
constraint y in each CHC should represent a straight-line code sequence with no 
branches (a.k.a. basic block). Technically, this can be formulated as the require- 
ment for each CHC to have a conjunction of literals (a.k.a. cube), i.e., have no 
disjunctions, in its body. 


Example 1. Fig. 2 gives a CHC encoding of the program in Fig. 1. There are eight 
CHCs over four uninterpreted predicates A, B, C, and D. The program entry 
is encoded in the first CHC (i.e., the only fact with the dst-predicate A), and 
its exit in the last CHC (i.e., with the dst-predicate D). All other CHCs encode 
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1 c=0 = A(z,y,z) Cait) 
2 A(a,y,z)A@>5Aa —r^y =ytlaAz =z B(a',y’, 2’) 

3 A(zx,y,z) ^m «5^a =a24+1Ay —y^z-—z B(x, y, 2’) <> 
4 B(z,y,z)^yX5^mx =a2Ay =y^Az =241 C (zx, y, z’) d 
5) B(z,yz)^y» 5^m»y^az =a2Ay =ytlaAz =z C (zx, y, z’) oY 
6 B(z,yz)^y»5^mzsXy^mx =0Ay =y =z Cz, y, 2’) c 
7 C(rz,y,z)^z #0 = A(x,y,z) 

8 C(r,y,z)^z —0 = D(z,y,z) © 


Fig. 2: CHCs of the motivating example (left) and src/dst-dependency graph (right). 


the loop with the total of six symbolic paths, following A —^ B > C — A (as 
can be seen from the graphic representation) but involving different CHCs. Each 
CHCs has no disjunctions in the body: it has the conjunction of the (negation 
of) guard and the encoding of program instructions following the corresponding 
branch until either a next conditional or the join occurs. Note that there are no 
queries in this system since there are no assertions in the program. 


To formulate the MBC problem at the level of CHCs, it is convenient to 
introduce the concept of a src/dst-dependency graph for a system of CHCs. 


Definition 4. Given a system S of CHCs over a set of uninterpreted predicate 
symbols R, its src/dst-dependency graph (R, E) is a directed graph with edges 
labeled by CHCs from S: 


E = ((rel(src(C)), C, rel(dst(C))) | C € S). 


Because we are bound in this paper to use only disjunction-free CHCs, the 
points of control-flow divergence in a program encoded in these CHCs are cap- 
tured by vertices in the src/dst-dependency graph that have more than one 
outgoing edge’. To generate a test case visiting some block of code encoded 
in a CHC Cx, it is enough to find an unrolling 7(06,,..,c, and show that this 
unrolling is satisfiable. In this case, the CHC is called reachable: i.e., the satis- 
fying assignment would naturally correspond to a program trace beginning at 


! Thus, in this case, the src/dst-dependency graph can be seen as a control-flow graph 
(CFG) of the encoded program. In practice, many verification tools that are based 
on CHC do not generate CHCs in such form but apply some generalization and 
compression to CFG during preprocessing. This results in CHCs with disjunctive 
bodies that are unsuitable for our approach. In these cases, we explicitly convert the 
body of each CHC to a disjunctive normal form (DNF) and clone the CHC for each 
cube in the DNF. The CHCs after this transformation is still a correct encoding of 
the original program, and its src/ dst-dependency graph is suitable for our approach, 
but it may not exactly match the CFG of the original program. 
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the program entry point and reaching the code in that branch. Furthermore, if 
the execution depends on some input values, these values can also be extracted 
from the satisfying assignment. 


Example 2. According to Fig. 2, the first point of control-flow divergence is 
predicate A. To show that CHC 3 is reachable, we create the following unrolling 
from bodies of CHCs 1 and 3: 


z—0^zx«5^z -—rz-l^y -y^zz-s. 
This formula is satisfiable, and there exists a model M = (x 4 0,y > 0,2 => 
0,...}, thus giving us two values for input variables y and z (both zeroes). 


It can also be seen that some CHCs cannot be visited by any trace. To find 
them, we can pose additional safety verification queries and aim at generating 
an appropriate invariant. 


Lemma 2. Given a system S of CHCs over some R, and let C be some CHC 
from S. If the extended CHC system SU (src(C) ^ qo. = > L} is satisfiable, 
then C is unreachable. 


'The proof of the lemma follows directly from Lemma 1. 


Example 3. In the CHC system in Fig. 2, CHC 5 is never reachable. We introduce 
a new query CHC Q as follows: 


Biz,y,z)A\y>b5Aa>yAd =ady =ytlArvaz Sl 
The system S U {Q} is satisfiable, with the following interpretation M: 
M(A) = M(B) = M(C)=dAa,y,z." <5 
Because x <5Ay>5Aa>y is unsatisfiable, CHC 5 is unreachable. 


These ingredients lets us state the MBC problem formally. 


Definition 5 (MBC). Given a system S of CHCs over some &, the problem 
of maximizing branch coverage of S is concerned with 1) determining a subset 
S, C S of CHCs which are provably unreachable (i.e., Lemma 2 applies), and 
2) finding satisfiable unrollings for all CHCs from S \ Sy. 


'The practical significance of the MBC problem consists in allowing the test- 
generation tools that are based on bounded model checking, e.g., [1], to terminate 
earlier. The invariants discovered while iteratively applying Lemma 2 can serve 
as annotations of various nodes of the program CFG, which further enables to 
prune the search space of the test cases. In particular, for our running example 
in Fig. 1, line 13 is provably unreachable, thus it makes no sense to search for 
its test case. 

Furthermore, with the invariant that blocks a branch at hand, the tools can 
now explore fewer unrollings leading to other branches in the next iterations of 
the loop. Specifically, to reach line 6, five iterations of the loop will provably 
skip line 13, so instead of (2*3)? = 7776 unrollings, the tool should only explore 
(2 x 2)? = 1024 unrollings. 
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6 Solving the MBC problem 


In this section, we introduce our novel approach to constructing the maximal 
branch coverage using a system of disjunction-free CHCs. We begin with outlin- 
ing our key ideas that can be implemented on top of existing test-case generators 
and invariant generators, and then proceed to describing our efficient implemen- 
tation. 


6.1 Key Insights 


'The approach has a simple high-level structure. Because the number of CHCs 
in a program encoding is always finite, we can pose a safety verification query 
for each of them. 

Existing CHC solvers are equipped with the functionality to generate both, 
the counterexamples and safety invariants. However, recent evaluation [17] show 
that the bounded-model-checking implementations often outperform general- 
purpose solvers on unsatisfiable CHC instances (likely, because they do not invest 
efforts in generating invariants). This suggests that for performance reasons, it 
makes sense to alternate between separate runs of a counterexample generator 
(via enumerating the unrollings) and an invariant generator. This allows for two 
main benefits, outlined in the next two paragraphs. 

A counterexample generator, in the MBC setting, should handle a large num- 
ber of unrollings. Many of the unrollings are unsatisfiable since some sequentially 
aligned branches might be incompatible, and some other branches might be wait- 
ing for a certain loop iteration. It is thus essential to share the information about 
conflicting paths’ segments (e.g., unsatisfiable prefixes, as in our implementation) 
to accelerate the search. Dually, satisfiable unrollings can often be extended to 
unrollings for other reachable CHCs, and this information can be exploited in 
the enumerative search for the remaining branches. 

An invariant generator, invoked multiple times throughout the process, deals 
with many largely similar safety verification instances (since all CHCs are the 
same, and only queries are different). Thus, a lot of information can be reused 
between verification runs, opening the opportunities for incremental verifica- 
tion [13]. Formally, all invariants that are discovered while proving the unreach- 
ability of a CHC will remain valid after switching to another CHC. Even more, 
the solvers that target conjunctive invariant generation, e.g., [15,24] can output 
“partial” invariants (i.e., some lemmas) even for unsatisfiable CHC instances, 
which then can be reused/completed in the next runs of the solver. 

'These observations let us to conclude that despite using off-the-shelf tools for 
bounded model checking and invariant generation is possible, an MBC will likely 
exhibit a more optimized performance through the design of new algorithms that 
incorporates the aforementioned insights. 


6.2 General Driver 


The pseudocode of our approach is given in Alg. 1. The algorithm begins with 
identifying a subset cur of CHCs that need to be considered in its iterations. 
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Algorithm 1: CHC-based test-case generator. 
Input: 5: a CHC system over R, 
Output: T: a set of satisfying assignments to variables in S 
Data: invs: mapping from & to invariants, G = (R, E): an edge-labeled 
graph, cur C S: a subset of CHCs to consider, length: counter 
representing the length of the current unrollings, traces: a (global) set 
of traces to consider 


1 (R, E) + src/dst-dependency graph of S; 
2 cur — (C | (u,C,v1) € E and A(u,_,v2) € E where vı Z v2}; 
3 if cur = Ø then cur + (C | src(C) = T}; 
4 length + 1; 
5 while cur 4 Ø do 
6 for chc € cur do 
7 (res, invs, cer) + SOLVECHCS(S U {body(chc) = > L}, invs); 
8 if res = SAT then 
9 cur < cur NV (chc); 
10 E + {(u,C,v) | (u,C,v) € E and C F chc); 
11 else if res = UNSAT then 
12 cur + cur \ (chc); 
13 T + TU {cer}; 
14 else 
15 traces + Ø; 
16 GETTRACES(E, T, chc, length, nil, prefizes, traces); 
17 for t € traces do 
18 (res, M) «— CHECKSAT(UNROLL(S, t)); 
19 if res — SAT then 
20 T —TU(MY) 
21 cur + cur \ (chc); 
22 break; 
23 else 
24 prefizes + prefires U (t); 
25 length «— length + 1; 


We say that a CHC C opens a branch, if the outdegree of rel(src(C)) in the 
src/ dst-dependency graph is greater than one (line 2). Thus, to generate a test 
case visiting a branch, it is enough to find an unrolling 706,0, where Ck 
opens that branch and show that this unrolling is satisfiable. If, however, there 
are no branches in the given program at all, then cur gets all facts of the CHC 
system (line 3), and the remaining coverage generation is straightforward. 


The rest of the algorithm is organized as a big loop that decides if any CHC 
from cur are (un)reachable and terminates when cur is empty. At each iteration 
of the loop, all CHCs from cur are enumerated, and the algorithm seeks to 
apply Lemma 2, i.e., extends S with one query and solves these CHC (line 7). 
The algorithm can use any CHC solving algorithm that decides the satisfiability 
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Algorithm 2: GETTRACES: trace enumerator. 


Input: E € 2**5**. labeled edges, u € R, chc € S, length: length of trace, t: 
trace prefix, prefizes: to avoid 
Output: traces: global set of traces of length beginning with relation u and 
ending with chc 


if Jp € prefixes . Vi.i € [0, |p|) => pi = t; then 
return; 
if length — 1 then 
if (u, chc, .) € E then 
traces + traces U (tG chc) 
else 
for (u, C,v) € E do 
GETTRACES(E, v, chc, length — 1, tQC, prefixes, traces); 


0 10 OU 5 9OiyHHu 


of CHCs, returns inductive invariants (line 8) or (optionally?) a counterexample 
(line 11). In both cases, the CHC is excluded from cur. Additionally, if satisfiable, 
this CHC cannot be used in any unrolling, and it is excluded also from the 
auxiliary graph (line 10, to prune the search space of the remaining test cases). 
If a counterexample is returned, the branch is reachable, and the test case is 
extracted from this counterexample (line 13). 


It is also possible (and in practice, very likely) that the CHC solver returns 
UNKNOWN (because the problem is undecidable, and invariant generators are 
often limited to either a fixed shape of invariants, or a certain timeout). In this 
case (lines 16-22), the algorithm proceeds with an explicit enumeration of un- 
rollings of a predetermined length (line 16). Each trace t = (to, ti,...,tiength—1) 
has an associated unrolling Qro o E, which is checked for the satisfi- 


ability (line 18) with an off-the-shelf SMT solver. If satisfiable (line 19), the 
branch opened by the current CHC is reachable, the test case is generated from 
the model, and the CHC is excluded from cur. If unsatisfiable (line 23), the 
algorithm registers this t as an unsatisfiable prefix to be avoided in the trace 
generation in the next iterations (see Alg. 2). 


Theorem 1. When Alg. 1 terminates, the resulting set T contains all the vari- 
able assignments needed for maximal coverage. 


In the next two paragraphs we discuss two important design choices that do 
not affect the correctness of our implementation, but optimize it. 


? [n fact, the counterexample detection in some CHC solvers, e.g., [24] proceeds in a 
similar fashion as described in our algorithm, but if invoked multiple times through- 
out the algorithm, it is likely that the CHC solver will perform many redundant 
actions. We thus do not use this functionality in our experiments (and our Alg. 3), 
but leave it in the pseudocode for the sake of completeness of presentation. 
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Algorithm 3: SOLVECHCs: 
Input: S: a CHC system over R, invs: mapping from R to invariants 
Output: res € (SAT, UNSAT), invs: updated mapping, [cez: counterexample] 
Vag; 
for chc € S do 
S' €— S' U {sre(che) ^ (vody(che)) [R e inus] => dst(chc)}; 
if Ax.T is a solution for S' then 
return inus; 
return (res, invs, ) — FREQHORN(S’); 


QnA Q Ne 


6.3 Incremental Trace Enumeration 


Our algorithm allows for sharing the information obtained during its iterations 
using two global data structures: the set of unsatisfiable prefixes discovered dur- 
ing the trace enumeration and the graph-structure (R, E) representing poten- 
tially reachable CHCs (line 10 of Alg. 1). Intuitively, the latter is constructed by 
an iterative removal of edges from the src/dst-dependency graph, thus allowing 
for a more focused search of suitable traces. Both data structures are used in 
Alg. 2 that is called at the next algorithm iteration. 

Conceptually, Alg. 2 is a dynamic-programming implementation of a path 
finder in an arbitrary directed graph. Given a length of path, its starting point 
and ending points, the algorithm recursively visits the graph edges and stores 
them in vectors?. In our setting, the algorithm is optimized in two ways. First, 
at line 1, it skips paths with unsatisfiable prefixes (because the corresponding 
unrollings will be unsatisfiable too). Second, at lines 4 and 7, it excludes all the 
unreachable CHCs that have been excluded from the graph previously. 


Example 4. Recall our running example for program encoded as CHCs in Fig. 2. 
For length = 2 and CHC (2), Alg. 2, constructs a single trace ((1), (2)), that 
corresponds to an unsatisfiable unrolling, found by Alg. 1, and thus added to 
prefixes. Consequently, for length = 3, traces ((1), (2), (4)) and ((1), (2), (6)) are 
not generated. Furthermore, because (5) is never reachable, then edge (B, (5), C) 
is excluded from E permanently. 


6.4 Incremental Invariant Discovery 


Alg. 3 gives the main idea of our CHC solver, which relies on the FREQHORN [15] 
algorithm to synthesize invariants (any other CHC solver could be used as well). 
However, in addition, it recycles the invariants inus generated in all previous 
runs. Specifically, it substitutes interpretations for each r € R in the body of 
each CHC (line 3). Because each such formula represents an over-approximation 


3 We use the notation t@C to represent the “push back” operation over a vector t and 
an element C. 
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of the set of reachable states at a particular program location, this substitution 
is sound. 

If it appears that after the substitution, all the remaining invariants are 
simply true-formulas (line 4) then invs is already a solution, and the CHCs 
solver is not needed. On the other hand, invariants could be generated by an 
external tool. 

While the pseudocode of FREQHORN is omitted from Alg. 3 for simplicity, 
we list its distinguishing features here. The approach is driven by Syntax Guided 
Synthesis (SyGuS) [2], and it supports (possibly, non-linear) arithmetic and ar- 
rays [16]. It automatically constructs formal grammars G(inv) for each inv € R 
based on either source code [14], or program behaviors [15,30]. Importantly, these 
grammars are conjunction-free, and they allow for only a finite number of candi- 
dates. FREQHORN iteratively attempts to apply production rules of each G(inv) 
to sample a candidate and checks it with an SMT solver (successfully checked 
candidate is then called a lemma). The process continues either until a con- 
junction of lemmas is sufficient, or the search space is exhausted. To make the 
process less dependent on the order in which candidates are considered, FREQ- 
HORN uses batching [12] (e.g., checks several candidates at the same time) and 
effectively filters them using the well known HOUDINI algorithm [18]. 

These features make FREQHORN especially useful for the application to test- 
case generation. Behaviors and counterexamples can be obtained from traces 
as outlined in Sect. 6.3. Each new counterexample potentially contributes to a 
new data candidate to be considered in the next invocations of the algorithm. 
Then, following our incremental schema, new candidates are used in conjunction 
with previously generated invariants, and either added to invs, or dropped. Note 
that even if FREQHORN returns UNKNOWN indicating that it is unable to find 
a strong enough invariant, it almost always finds some lemmas that might be 
useful for the next iterations of our main algorithm. 


7 Evaluation 


We have implemented the approach in a tool called HoRNTINUUM^. The backend 
of HORNTINUUM is developed on top of FREQHORN [14] and uses it for CHC 
solving. All the symbolic reasoning in our backend is performed by the Z3 [27] 
SMT solver, v4.8.10. For encoding C benchmarks to CHCs in our frontend, we 
use the SEAHORN [21] verification framework, v10.0.0-rc0, via its Docker image”. 


Implementation details. The success of our approach largely depends on 
the preprocessing performed by SEAHORN while producing the CHC encoding. 
Since our algorithm works on disjunction-free CHCs (recall Sect. 6.1), we con- 
figure SEAHORN to perform a small-step encoding, i.e., introducing a CHC per 


4 The source code of the tool is publicly available at https://github.com/izlatkin/ 
HornLauncher with the CHC-based backend at https://github.com/izlatkin/aeval/ 
tree/tg. 

5 https: //hub.docker.com/r/seahorn/seahorn-llvm10. 
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each basic block (via the --step-small option). However, the encoder, based 
on LLVM, additionally performs several LLVM transformations? and auxiliary 
SEAHORN’s passes that may introduce disjunctions to CHCs. Since this recipe 
is not configurable in SEAHORN yet, we additionally get rid of disjunctions, by 
performing a DNF-ization, over the CHCs received from SEAHORN. 

We also had to overcome a relatively minor engineering obstacle to allow 
recognizing multiple nondet() function calls (see an example in Fig. 1). The 
CHC representation is in some sense declarative, i.e., it is not always possible 
to detect the order of function calls from formulas that represent program un- 
rollings. Thus, we rename each invocation of nondet () in each input C file, e.g., 
to nondet;() which lets HORNTINUUM to associate each function invocation 
with a sequence of static-single-assignment (SSA) variables that encode (possi- 
bly many, if nondet;O is called in a loop) outputs of nondet;() occurring in 
an unrolling. Further, it gives a sequence of concrete values obtained for each 
of the SSA variable by the SMT solver. In a generated test case, sequences of 
SSA values of each nondet; are stored in a separate array (to capture values 
in each loop iteration) and accessed by an automatically generated body of the 
corresponding unique nondet;() function. 

In a sense, the final output of our tool is a set of context-specific implemen- 
tations of function nondet() written in different header files. The initial C file 
should include a header from this set, be compiled and run in order to reproduce 
the detected test case." 


Experimental setup. To evaluate HORNTINUUM, we configured the GCOV 
tool, v9.3.0, a code coverage analysis and profiling tool that tracks all statements 
visited in a single run of the program. Running Gcov for each our generated test 
case and merging the statistics gives the final coverage: we ultimately target to 
maximize the number of code visited by at least one test case.? 

We compared HORNTINUUM with state-of-the-art tools FUSEBMC [1], VERI- 
FUZZ [9], and KLEE [8]? that exhibited a decent performance in TestComp 2021. 
Our experiments were run on a “Dell OptiPlex 7090 Tower" desktop computer 
with 2.5 GHz Intel Core i7 8-Core (11th Gen), 16GB 3200 MHz DDR4 RAM, 
and Ubuntu 20.04.1 LTS installed on it. 

For the experimentation we considered 316 benchmarks from TestComp (from 
loop-* tracks, excluding the programs with floating points that our CHC solver 


6 One transformation, for instance, removes redundant branches from the code, e.g., 
replaces if (nondet()) fooO; else fooO; by just foo. Technically, the CHC 
encoding received by our tool does not represent all branches of the original program, 
whilch thus leads to a smaller coverage detected. We have not seen many such 
examples in our benchmarks set, however. 

T Note the difference with the TestComp format [4] that keeps all values in the same 
XML file. Our proposed format is more general and easily convertible to TestComp. 

5 The full logs and tables are available at https://www.cs.fsu.edu/~grigory/ 
horntinuum.zip. 

? All the binaries were downloaded from  https://test-comp.sosy-lab.org/2021 / 
systems.php. 
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Fig. 3: Coverage comparison: each point in a plot represents a pair of the coverages 
(96 x 96) of HORNTINUUM (x-axis) and a competitor (y-axis) for the same benchmarks. 


does not support yet). The largest considered benchmark has >5K LoC. The per- 
formance of all three competitors (using the timeout of 15 minutes) on our ma- 
chine was consistent to the one exhibited in TestComp 2021: VERIFUZZ slightly 
outperforms FUSEBMC, and both outperform KLEE. 


Expectations and results. We aim to answer two main questions: 


Qı Is it possible to develop a competitive test-case generator based purely on 
formal verification and SMT solving, i.e., not relying on dynamic analysis 
and fuzzing? 

Q» Inthe cases when a CHC-based test-case generator yields a similar (or better) 
coverage than a competitor, is it possible to achieve this result faster!°? 


Plots in Fig. 3 and Fig. 4 attempt to answer these questions, respectively. 
We first give a pairwise comparison between the coverage 9696 reported by 
the tools (Fig. 3). If a tool was unable to analyze a program, the corresponding 


1? We believe the ability to successfully terminate the test-case generation early is 
of great interest to software engineers. However, unfortunately, it is not the main 
determining factor in testing competitions. 
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Fig.4: Runtime needed to get 196 of coverage (sec x sec) of HORNTINUUM (x-axis) 
and a competitor (y-axis). Solid triangles represent runs (green: HORNTINUUM, orange: 
the competitor) in which the corresponding tool detected larger coverage and took less 
time. Blank triangles are the remaining (non-representative) runs. Triangles on the 
boundaries represent runs in which one of the tool detected zero coverage. 


point is placed on the boundary. The experiments revealed that given the same 
timeout, HORNTINUUM generates test cases with larger or equal coverage than 
KLEE on 241 programs, FUSEBMC on 178 programs, and VERIFUZZ on 177 
programs. These numbers include cases when the competitor crashed or did not 
return any coverage but exclude cases when HORNTINUUM did so. 

A pairwise comparison between the “runtime/coverage” ratio taken by the 
tools is shown in Fig. 4. For this experiment, for every plot, we only considered 
benchmarks, on which either of tools generated test cases with larger coverage, 
and on which it terminated before the competitor. Specifically: 


— 177 (resp. 44) on which VERIFUZZ (resp. HORNTINUUM) was outperformed. 
— 128 (resp. 44) on which FUSEBMC (resp. HORNTINUUM) was outperformed. 
— 124 (resp. 26) on which KLEE (resp. HORNTINUUM) was outperformed. 


'These numbers lets us conclude that it is much likely that HORNTINUUM could 
return a larger coverage in a shorter amount of time, then a competitor could 
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Fig.5: Impact of invariants: pairs of the runtimes (sec x sec) of HORNTINUUM with 
and without invariants. 


do so. The remaining benchmarks (e.g., on which HORNTINUUM generates more 
test cases but takes more time than a competitor) are still shown in the plot 
but are excluded from the statistics: in these cases it is impossible to draw a 
consistent conclusion on the tools! performance. 


Controlled experiment. Lastly we present an interesting statistic on the 
effect of invariant generation on runtime of test-case generation (Fig. 5). For the 
sake of experiment, we modified Alg. 1 such that it skips invariant generation but 
enumerates traces and exploits the unsatisfiable prefixes. It turns out that this 
negatively affects 184 benchmarks, on which the modified version takes more 
time. These include 12 benchmarks, on which HORNTINUUM with invariants 
terminates before the timeout, but HORNTINUUM without invariants does not 
terminate (represented as points on the right boundary). These benchmarks 
demonstrate a possible scenario when programs under test have unreachable 
branches that can be identified by a CHC solver, allowing the test-case generator 
to terminate earlier. 


8 Conclusion 


We have shown that CHCs is a promising vehicle that a test-case generators 
could use in order to improve the quality of solutions and the runtime. Specif- 
ically, using CHC encodings of programs, various program unrollings are enu- 
merated, and test cases are extracted from models of satisfiable formulas. Our 
novel CHC-based approach and its implementation in HORNTINUUM use SMT 
solvers incrementally. In the future we are going to extend our support for data 
types and optimize the algorithm for searching deep counterexamples a la [6]. 


Acknowledgments The work is supported in parts by a gift from Amazon 
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Abstract. Many safety critical systems guarantee fault-tolerance by us- 
ing several redundant copies of their components. When designing such 
redundancy architectures, it is crucial to analyze their fault trees, which 
describe combinations of faults of individual components that may cause 
malfunction of the system. State-of-the-art techniques for fault tree com- 
putation use first-order formulas with uninterpreted functions to model 
the transformations of signals performed by the redundancy system and 
an AlISMT query for computation of the fault tree from this encoding. 
Scalability of the analysis can be further improved by techniques such as 
predicate abstraction, which reduces the problem to Boolean case. 

In this paper, we show that as far as fault trees of redundancy archi- 
tectures are concerned, signal transformation can be equivalently viewed 
in a purely Boolean way as fault propagation. This alternative view has 
important practical consequences. First, it applies also to general re- 
dundancy architectures with cyclic dependencies among components, to 
which the current state-of-the-art methods based on AIISMT are not 
applicable, and which currently require expensive sequential reasoning. 
Second, it allows for a simpler encoding of the problem and usage of 
efficient algorithms for analysis of fault propagation, which can signif- 
icantly improve the runtime of the analyses. A thorough experimental 
evaluation demonstrates the superiority of the proposed techniques. 


1 Introduction 


Fault-tolerance is a fundamental property of safety critical systems that enables 
their safe operation even in the presence of faults. There are many ways to 
ensure fault-tolerance, often based on redundancy: spare parts are available for 
backup and are ready to take over with different degrees of promptness (e.g., 
hot /warm/cold standby), or with multiple replicas running in parallel. The latter 
is a common approach to fault-tolerance in computer-based control systems, 
where the results computed by the independent replicas are combined together 
by means of voters. The idea dates back to the pioneering space application in 
Saturn Launch Vehicle [12], and has then been adopted in the Primary Flight 
Computer [19] of the Boeing 777. The idea is becoming prominent with the 
advent of modern Integrated Modular Avionics [16], a cost-effective solution for 
the management of highly intensive software control systems. 


(9 The Author(s) 2022 
D. Fisman and G. Rosu (Eds.): TACAS 2022, LNCS 13244, pp. 273-291, 2022. 
https://doi.org/10.1007/978-3-030-99527-0. 15 
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Fig. 1: Network of computational modules with cyclic dependencies, extended by 
triple modular redundancy. 
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Fig. 2: Selected ways of extending a single reference module M with triple mod- 
ular redundancy (using 1, 2, and 3 voters) [6]. 


One of the most used instances of the approach to redundancy by using 
module replicas is the triple modular redundancy (TMR) schema, in which the 
computational modules are replaced by three redundant copies, whose results can 
be combined by one to three voters. An example of using TMR to add redundancy 
to a reference non-redundant architecture is shown in Figure 1. Note that there 
are multiple ways of combining the results of a single triplicated computational 
module by voters, some of which are shown in Figure 2 [6]. 

Assessing the actual degree of fault-tolerance of a redundant architecture 
is directly related to the construction and analysis of the corresponding fault 
tree [17]. A fault tree describes the combinations of failures of individual com- 
ponents that may cause higher-level malfunction, e.g., bring the system into a 
dangerous state. Such combinations are traditionally called cut sets. Given the 
set of all cut sets of the system, a fault tree can be reconstructed. Subsequently, 
from the fault tree expressed as a Binary Decision Diagram, it is possible to 
compute the reliability of the system from the reliability measures of the com- 
ponents, and to synthesize the analytical form of the reliability function [6]. 

In this paper, we tackle the problem of automatically analyzing the reliabil- 
ity of redundancy architectures with parallel replicas and voting. We propose a 
general framework that encompasses also redundancy architectures with cyclic 
dependencies among components, such as the system from Figure 1, to which 
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current state-of-the-art approaches [6] are not applicable. The modeling is based 
on symbolic transition systems over the quantifier-free theory of linear real arith- 
metic and uninterpreted functions (UFLRA). In particular, real numbers are 
used to represent the signals of the architecture and multiple instances of the 
same uninterpreted function symbol are used to represent component replicas. 
'The modeling framework is a strict generalization of the combinational approach 
proposed in [4,5], that only allows for acyclic architectures. 

As the main contribution, we propose an analysis technique based on the 
reduction to fault propagation graphs over Boolean structures [7]. We prove that 
the reduction is correct: the signal transformation performed by a redundancy 
architecture can be equivalently viewed in a Boolean way as fault propagation. 

We carry out a systematic experimental evaluation on the set of redundancy 
architectures with cyclic dependencies to evaluate scalability of the proposed so- 
lution. Moreover, we perform evaluation on acyclic redundancy architectures to 
compare the performance against the state-of-the-art approach based on pred- 
icate abstraction [5,6], which can be applied only to redundancy architectures 
without cycles. The proposed approach proves to be very scalable, being able to 
analyze cyclic architectures with thousands of nodes, and is dramatically more 
efficient than a direct reduction to model checking of symbolic transition systems 
over UFLRA. In the restricted set of acyclic benchmarks, the proposed approach 
provides better performance even over the optimized method proposed in [5] and 
extended in [6] that adopts a structural form of predicate abstraction to improve 
over basic AlISMT [14]. 

'The paper is structured as follows. In Section 2, we present logical preliminar- 
ies and basic notions of fault propagation graphs. In Section 3, we describe the 
framework of redundancy architectures with cycles. In Section 4, we present the 
reduction to fault propagation and prove its correctness. In Section 5, we discuss 
the related work. The experiments are presented in Section 6. In Section 7, we 
draw some conclusions and discuss some directions for future work. 


2 Preliminaries 


2.1 General Background 


In this section, we explain the basic mathematical conventions that are used in 
the paper. We assume that the reader is familiar with standard first-order logic 
and the basic ideas of Satisfiability Modulo Theories (SMT), as presented e.g. 
in [1]. A theory in the SMT sense is a pair (2, C), where X is a first-order signature 
and C is a class of models over X. We use the standard notions of interpretation, 
assignment, model, satisfiability, validity, and logical consequence. We refer to 
0-arity predicates as Boolean variables, and to O-arity uninterpreted functions 
as (theory) variables. We denote variables with x, y, ..., formulas with p, v, ..., 
and uninterpreted functions with f,g,..., possibly with subscripts. We denote 
vectors with * (e.g. z), and individual components with subscripts (e.g. xj). We 
denote the domain of Booleans with B = (T, L}. If z4,...,2,, are variables and 
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y is a formula, we write y(21,...,2,,) to indicate that all the variables occurring 
free in y are in z1,..., £n. If y is a formula without uninterpreted functions and 
u is a function that maps each free variable of p to a value of the corresponding 
sort, [v], denotes the result of the evaluation of y under this assignment. A 
Boolean formula is called positive if it does not use other logical connectives 
than conjunctions and disjunctions. 

In this paper, we shall use the theory of linear real arithmetic (LRA), in 
which the numeric constants and the arithmetic and relational operators have 
their standard meaning, extended with uninterpreted functions (UF), whose in- 
terpretation is not fixed in C, and with voters (V), which are k-ary functions 
whose interpretation is the majority function defined as below. For simplicity, 
we consider only voters with odd arity as even-arity voters are rarely used in 
practice. However, our approach can be extended to support even-arity voters. 


Definition 1. The k-ary majority function majority: RF — R for an odd k > 0 
is defined by majority(z) = y if there is y such that y = x; for at least [k/2] 
distinct j and majority(Z) = xı otherwise. 


Given a set of variables Z, we denote with z' the set {x’ | r € Z}. A symbolic 
transition system S is a triple (z, I(z), T (z, z')), where 7 is a set of variables, and 
I(x), T(z,z') are formulae over some signature. An assignment to the variables 
in 7 is a state of S. A state s is initial iff it is a model of I(Z), i.e., s = I(z). 
The states s,s’ denote a transition iff sU s' = T'(z, z'), also written T (s, s'). A 
trace is a sequence of states so, 51,... such that so is initial and T'(si, s;,,) for 
all i. We denote traces with v, and with r; the j-th element of 7. A state s is 
reachable in S iff there exists a trace 7 such that 7; — s for some i. 


2.2 Fault Propagation Graphs 


In this section we briefly introduce the necessary notions of fault propagation, 
and in particular the formalism of symbolic fault propagation graphs. Intuitively, 
fault propagation graphs can be used to describe how failures of some compo- 
nents of a given system can cause the failure of other components of a system. 
In an explicit (hyper)graph representation, components can be represented by 
nodes, and dependencies by edges among them, with the meaning that an edge 
from component c; to component c9 states that the failure of cı can cause 
the failure (propagation) of c3. In the symbolic representation adopted here, 
we model components as Boolean variables (where L means “not failed" and T 
means "failed"), and express the dependencies as Boolean formulae encoding the 
conditions that can lead to the failure of each component. The basic concepts 
are formalized in the following definitions. For more information, we refer to [7]. 


Definition 2 (Fault propagation graph). A symbolic fault propagation graph 
(FPG) is a pair (C, canFail), where C is a finite set of system components 
and canFail is a function that assigns to each component c a Boolean formula 
canFail(c) over the set of variables C. 
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Definition 3 (Trace of FPG). Let G be a fault propagation graph (C, canFail). 
A state of G is a function from C to B. A trace of G is a sequence of states 
T = ToT... € (BC)* such that alli > 0 and c € C satisfy (i) mi(c) = mi-1(c) 
or (ii) m-1(c) = L and mi(c) = [canFail(c)]s, , . 


Example 1 ([7]). Consider a system with components control on ground (G), 
hydraulic control (H), and electric control (E) such that G can fail if both H 
and E have failed, H can fail if E has failed, and E can fail if H has failed. This 
system can be modeled by a fault propagation graph ({G, E, H}, canFail), where 
canFail(G) = H ^ E, canFail(H) = E, and canFail(E) = n. 

One of the traces of this system is {G > L,H > T,E œ> LHG o Law 
T,Ee THG T,ue T,Ee TS", where H is failed initially, which causes 
failure of E in the second step, and the failures of H and E together cause a failure 
of G in the third step. 


Fault propagation graphs are often used to identify sets of initial faults that 
can lead the system to a dangerous or unwanted state (usually called a top level 
event). Such sets of initial faults are called cut sets. 


Definition 4 (Cut set). Let G be a fault propagation graph G = (C, canFail) 
and q a positive Boolean formula, called top level event. The assignment cs: C > 
B is called a cut set of G for q if there is a trace n of G that starts in the state cs 
and there is some k > 0 such that y, = y. A cut set cs is called minimal cut set 
if it is minimal with respect to the pointwise ordering of functions IBC , i.e., there 
is no other cut set cs’ such that (c € C | es'(c) = T) GC (ce C | es(c) = T]. 


For brevity, when talking about cut sets, we often mention only the compo- 
nents that are set to T by the cut set. 


Example 2 ([7]). The minimal cut sets of the FPG from Example 1 for the top 
level event y = G are {G}, {H}, and {E}. These three cut sets are witnessed by 
the following traces: 


1. (ao T,H o LED 1)7, 
2. {G> L,e T;,Eeo 1a LH TER THao T;,neo TEM TI", 
3. {GH LH LEW THER L,e TER THao T;,no TEN THY. 


Note that the FPG has also other cut sets, such as {G, E}, (H, E}, and {G, H, E}, 
which are not minimal. 


In the following, we work with fault propagation graphs whose all canFail 
formulas are positive. Such fault propagation graphs are called monotone. Note 
that the definition of trace ensures that in each trace, if a component c is set to 
T in a state mj, it is T in all the subsequent states 7; for j > i. This ensures 
that each trace eventually reaches a fixed point. Moreover, before reaching this 
fixed point, the trace can contain at most |C| distinct states. 

For monotone FPGs, there is an efficient algorithm for minimal cut set enu- 
meration [7]. This approach consists in enumerating of the minimal models of a 
specific LRA formula, in which theory constraints are used only if the input FPG 
contains cycles (and which therefore is purely Boolean for acyclic FPGs). 
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3 Cyclic Redundancy Architectures 


In this section, we describe the framework adopted to model redundancy ar- 
chitectures, in form of a restricted class of symbolic transition systems modulo 
UFLRA. We call this restricted class transition systems with uninterpreted func- 
tions and voters (UF+V TS). This modeling framework is more expressive than 
mere SMT formulas modulo UFLRA, which were used in the previous works on 
analysis of redundancy architectures [6], as it can express architectures that 
contain cyclic dependencies among the modules. 


Definition 5 (UF+V transition system). A transition system with unin- 
terpreted functions and voters is a tuple (Vs, Vin, Vinit, Tnext, Tini), where 


— Vs is a finite set of real-valued signal variables; 

— Vin with Vs AN Vin = 0 is a finite set of real-valued input variables; 

— Vinit is a finite set of real-valued initial value variables; 

— Thext: Vs — Expr is a transition function, where Expr is the set of all 
expressions of form f (xi,22,..., £k) fork > 0, x; € (Vs UV), and where f 
is either an uninterpreted function symbol of arity k or the function symbol 
voter, with an odd k > 0; 

— Tinit is an initial value mapping that assigns an initial value variable Ti, (v) € 
Vinit to each signal v € Vg for which Tnext(v) = f(x) for an uninterpreted f. 


A UF-V transition system is called well formed if it does not contain cyclic 
dependencies among voters, i.e., there is no sequence v4 ... v, of signal variables 
such that vı = vn and each v; with i > 0 satisfies Thext (Vi) = votery(z,..., Lp) 
with x; = vii for some 1 < j < k. For well formed UF+V TS, we can define 
voter depth vd: Vs U Vin — N as the unique solution to the following set of 
equations: vd(in) = 0 for each in € Vin, vd(s) = 0 for each v € Vs such that 
Tuea(v) = f(z1,22,..., £k), and vd(v) = max(vd(z;) | 1 € i € k) -- 1 for each 
v € Vg such that Tae (v) = voterg(zi, xo, ... , vx). 

In the rest of the paper, we assume that all UF+V TS are well formed. In the 
rest of this section, let us fix an arbitrary well formed UF-4-V transition system 
S= (Vs, Vin, Vinit, Thext, Tinit). 

We now give a formal definition of the behavior of the UF+V system in pres- 
ence of faults. Intuitively, we are given the set Faults of faulty signal-producing 
components of the system, which do not have to behave correctly: a faulty com- 
ponent neither has to start in its specified initial value nor respect its transition 
function. 


Definition 6 (Trace of UF+V TS). A state of a UF+V transition system S 
is an arbitrary assignment of real numbers to signal and input variables s: (Vs U 


Vin) >R. 


! Note than although UF--V TS and the related concepts can be defined directly 
in terms of UFLRA symbolic transition systems, we chose to make the definition 
explicit to simplify the presentation and proofs. 
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The sequence of states m = Tom... € (RVsYVia)® is called a trace of the 
system S for the fault set Faults C Vs, input stream 1 = (ou... € (RY), 
initial value assignment Init: Vinit — R, and interpretation [.]. which to each 
uninterpreted function symbol of arity k assigns a function [f]: RE > R, if: 


— milin) = u(in) for all i > 0 and in € Vin. 

— For v € Vg \ Faults such that Tuex«(v) = f(£1,..., £k) with an uninterpreted 
function symbol f, it is the case that mo(v) = Init(Tini(v)) and all i > 0 
satisfy m(v) = [f] (mi-1 (1), ---, Ti-1(£k)). 

— For alli > 0 and v € Vs \ Faults such that Tnext(v) = voterg(zi,..., £p), it 
is the case that m;(v) = majgority(m(21), . .. , Ti(£k)). 


Traces for the fault set Faults = () are called nominal. 


Note that each uninterpreted module needs one time step to compute its 
result, while the results of voters are instantaneous. The time delay for modules 
allows cyclic dependencies among modules, while no delay for voters gives the 
expected semantics to architectures where some replicas of a module are guarded 
by a voter and others are not, such as in schemas from Figures 2b and 2c. 


Example 3. Consider the example from Figure 1, where the reference system 
with 3 modules Mi, M2, and M3 is extended with TMR such that the modules 
Mı and Mə are replaced by three replicas whose results are combined by a voter. 

We can represent the redundancy version of the system as a UF+V TS as 
follows. The nominal behavior of the modules Mi, M2, and M3 is represented by 
binary uninterpreted functions f1, fo, and f3, respectively. Further, we represent 
initial values of Mi, Mz, Ms by variables init,,,, init,,,, and init,,, respectively. 
Finally, we represent the output of i-th replica of each module M; by a signal 
variable ai and the output of the voter corresponding to the module M; by a 
signal variable x}. 

This gives the UF+V transition system S = (Vs, (ini, ine}, Vinit, Tnext; Tinit), 
with Vs = {xi, 21,2], £1, £3, 22, 23, 03,93], Vinit = {initm, | j € (1,2, 3] ), and 


Taext (£$) = fi(ini, 23) for 1 « à « 3, Tinit(zi) = initm, for 1 <i «€ 3, 
Toexs(25) = fa(ino,21) for 1 € 4 € 3, Tinit (24) = initm, for 1 < i < 3, 
Tig (£3) = doe Tinit (£3) = initmz, 

Thext (tj) = voters (aj, 25,25) for j € (1,2). 


We define the class of redundancy transition systems, where the only pur- 
pose of all voters is to recognize and repair outputs of failed components; more 
specifically, if all components behave correctly, the voters are not necessary. 


Definition 7 (Redundancy UF+V TS). We call the system S a redundancy 
UF+V transition system if in all its nominal traces, all inputs of each voter are 
always identical. Formally, if n is any nominal trace of S and if v is a variable 


for which Tnext(v) = voter, (T), then |(ri;) |I<j< 2l — 1 for all i > 0. 
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Similarly to FPGs, a cut set is a set of faults that leads to the undesired behavior 
of the system. In particular, given a set of signals that are considered as output 
signals (or outputs) of the system, a cut set of the given UF+V TS is a set of 
faults that can cause an incorrect value of at least one output. 


Definition 8 ((Minimal) cut set). A fault set Faults C Vs is called a cut set 
of S for a set of output signals Vous C Vs if there exist an input stream, initial 
value assignment, and an interpretation such that values of output signals of 
some trace T for the fault set Faults differ from the outputs of the nominal trace 
71^?" with the same input stream, initial values, and interpretation, i.e., there 
is c > 0 and o € Vout for which nelo) 4 n2?" (0). A cut set is called minimal 
(MCs) if it is minimal in terms of set inclusion. 


Since the redundancy UF+V TS form a subclass of UFLRA transition systems, 
there is a straightforward procedure for minimal cut set enumeration. As in 
the case of combinational systems [6], one can construct a miter system, which 
consists of two copies of the architecture: the first is allowed to fail and the second 
is constrained to behave nominally. Minimal cut sets can then be obtained by 
using a technique based on symbolic model checking [3] to enumerate all minimal 
assignments to fault variables under which it is possible to reach some state in 
which the outputs of the two copies differ. 


4 Reducing Redundancy UF+V TS to Fault Propagation 
Graphs 


In this section, we show the main result of the paper, which is that minimal 
cut set enumeration of redundancy UF+V transition systems can be reduced to 
minimal cut set enumeration of Boolean fault propagation graphs, which is more 
efficient than MCS enumeration based on miter construction and model checking. 


4.1 Reduction 


We for each UF+V system S define a corresponding FPG SP. The components 
of SP correspond to the signal variables of the original system S. With a slight 
abuse of notation, we use the same names for the original real-valued signal 
variables of S and the components of SP, although they have different types. 
Intuitively, the reduction ensures that each component v of SP can fail if and 
only if there is a trace of S in which the value of the signal variable v deviates 
from its nominal value. 


Definition 9. Let S = (Vs, Vin, Vinit, Tnext, Tinit) be a UF--V TS. We define 
a corresponding FPG SP = (Vs, canFail), where canFail(v) = Mueanvs U if 
Tnext(v) = f(x) and canFail(v) = atLeastpy ja (EM Vs) if Tnext(v) = voter, (a), 
using the definition atLeastm(X) — V vcx x Ayey U^ y. 
Ivi- 
? Note that there are more efficient and compact encodings for the atLeast con- 
straint [18]; we use the most simple one for presentation purposes. 
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Example 4. Consider the transition system S from Example 3. The correspond- 
ing fault propagation graph is SP = ((z1,22,22, 21,24, 22,03, 0%, 24}, canFail), 


where 


canFail(ri)- x3 for alll <i<3, canFail(x,) =a} forallcix83, 
canFail(x3) = x} V xb, 
1 


canFail(x?) = atLeasto (x1, £3, £3), ^ canFail(z3) = atLeaste (x5, 73, x3). 


4.2 Correctness 


We show that the reduction preserves the cut sets. In the rest of the section, let 
S = (Vs, Vin, Vinit, Tnext; Tinit) be an arbitrary redundancy UF+V TS, Faults C 
Vs be an arbitrary fault set, and Vout C Vs be an arbitrary set of output signals. 
First, we show that each cut set of S corresponds to a cut set of SP. 


Lemma 1. /f Faults is a cut set of S for the set of outputs Vout, then cs defined 


as cs(v) — T iff v € Faults is a cut set of SP for the top level event M acit; 2- 


Proof. Let Faults be a cut set of S for some trace m for some 1, Init, and [ ]. 
Let 7"?" be the corresponding nominal trace. Define the trace 7? of SP as 
nË = cs and for all i > 0 define x? by në (v) = T if në (v) = T and të (v) = 
[canFail(v)],s , if mB (v) = L. In other words, 7? is the unique trace starting 
in cs in which all the components fail as soon as possible. By monotonicity, the 
trace 7? has a fixed point, i.e., there is n such that tP = «P for all n' > n. 
We show that «P satisfies në (o) = T for some o € Vou, and thus cs is a cut 
set for the top level event \/,<y,,, o. To do this, we prove by induction on and 
on the voter depth vd(v)? that for all v € Vs and i > 0, m;(v) Æ n7?" (v) implies 
rË (v) = T. We distinguish three cases: 
— If v € Faults, then në (v) = T. From the definition of 7?, this implies that 
rë (v) — T for all! > 0. In particular, nË (v) = T. 
— If v ¢ Faults and Thext(v) = f(z1,..., £k), we distinguish two cases: 
e If i = 0: since mo(v) A nj?" (v), then it must be the case that mo(v) A 
Init(Tinit(v)), therefore v € Faults. This is a contradiction. 
e Ifi 7 0: then z;(v) Z 17?" (v) by definition implies 


[F] mi-i (21), -o te-1(@e)) F LF] PET (01); mI (nk) 
and hence m; i(zj) Z ToT (xj) for some 1 € j < k because [f] is a 
function. Since v; 1(in) = «7*7 (in) holds for all in € Vin, we know that 
xj € Vs. Therefore the induction hypothesis implies t?(2;) = T and 
thus zP (v) = T because «P satisfies canFail(v). Since 7? was chosen 
as the fixed point of 7, this implies 7? (v) = «P, (v) =T. 
3 Induction on the voter depth is employed because UF+V transition systems propa- 


gate results of voters instantaneously. 
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— If v ¢ Faults and Thext(v) = votery(z1,..., £p), then z;(v) A m7?"(v) for 
any i > 0 by definition implies 


nom nom 


majority(ri(z1),....vi(zx)) A majority(n7^" (z1),..., 7p ^" (£k) — (1) 


non 


Since S is a redundancy TS, all 77?" (v;) are equal and the disequality (1) 
implies that 7;(x;) Z 7°" (x;) for at least [k/2] of xj. All these x; are not in 
Vin and must therefore be in Vs. By definition of voter depth, vd(r;) < vd(v) 
for all these z;. Therefore by the induction hypothesis mË (x;) = T for at 
least [k/2] of x; and thus 72, (v) = T because 7? satisfies canFail(v). This 
again implies «2 (v) = «P, (v) = T because 77? is the fixed point of 7”. 


This finishes the proof: if Faults is a cut set, 7.(0) Z 12?" (0) for some c > 0 
and o € Vout, and thus mË (o) = T. Therefore we know that mË [- V, ev, 2 and 


thus cs is a cut set of SP. 


For the converse direction, we for each fault set devise a trace of the UF+V 
TS S that propagates all the possible deviations from the nominal value. We call 
this trace maximally fault-propagating. In this trace, all signal values are from 
the set (0, 1), all nominal signal values are 0 and become 1 only as a result of 
a fault. Moreover, if there is a trace for the given fault set in which a signal 
deviates from its nominal value, the value of the corresponding signal in the 
maximally fault-propagating will be 1. 


Definition 10 (Maximally fault-propagating trace). Let S be a UF+V 
TS. Define 


— ilvin) — 0 for alli > 0, vin € Vin, i.e., v is a stream of constant zero inputs; 

— Init(vinit) = 0 for each Vinit € Vinit; and 

— [7]1(21...,2&) = 1- ILI e; (12) for each uninterpreted f, i.e., the output 
is O if all inputs are 0; it is 1 if at least one input is 1. 


The maximally fault-propagating trace of S for a fault set Faults, denoted as v? , 
is the unique trace of S for the above input stream, initial values, interpretation, 
and the given fault set that for alli > 0 and v satisfies n]? (v) = 1 whenever 
v € Faults. 


Observe that the trace m? is monotone, i.e., once a signal gets set to 1, it 
stays set to 1 for the rest of the trace. This is formalized by the following lemma, 
which can be proven by induction on 2, 7 — i, and voter depth of v. 


Lemma 2. Let S be a UF+V TS, Faults a fault set, and n? the corresponding 
maximally fault-propagating trace. Then n]? (v) = ] for each i > 0 and v € Vg 
implies r? (v) — 1 for all j >i. 


We can now show that if a trace of the FPG version SP of a UF+V TS S 
triggers the top level event for some initial fault assignment, there is a trace in 
the original system S for the corresponding fault set whose output deviates from 
the nominal one; namely it is the trace q”. 
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Lemma 3. If cs defined as cs(v) = T iff v € Faults is a cut set of SP for the 


top level event Voev... o, then Faults is a cut set of S for the set of outputs Vout. 


Proof. Suppose that the trace 7? of SP with the initial state cs satisfies në (o) = 
T for some c > 0 and o € Vout. We show that Faults is a cut set of S for the set 
of output signals Vout. Let 7/? be the maximally fault-propagating trace of S for 
Faults and 7"?" the corresponding nominal trace. 

We show that for each i > 0 and v € Vs, the condition më (v) = T implies 
TP (v) A x2?" (v). We proceed by induction on i: 


— For i = 0: If cs = në (v) = T, then v € Faults and thus ni? (v) z nj?" (v) 
because 7# (v) = 1 and MAS )=0. 
— For i > 0: Assume that 7?(v) = T. We distinguish four cases: 

e If v € Faults then af?(u )= 1 and r mlo ) Z n7?" (v) = 0. 

e If 7? ,(v) ^ T, then we get that rP ,(v) Z n7?T (v) from the induction 
hypothesis, and thus T” (v) Z 179?" (v) by Lemma 2. 

e If v ¢ Faults, ma (o) = L, and Thext (V) = f(z1,.-., 2k), then rË (v) = T 
implies that 7? (xj) = T for at least one z; € Vs. From the induction 
hypothesis, we i that P P (xj) Æ TT (xj) and ~ a j) = 0, 
we w that r” P (£j) = 1. By the deemon of [f] in x!”, we know that 
also 71 P(v) = 1, which is not equal to 77?" (v) = 0. 

e If v ¢ Faults, qB ) = L, and Thext(v) = voters (a1,..- vk), then 
Të (v) = T implies that at least [k/2] of x; € Vs satisfy «P ,(x;) = T. 

From the induction hypothesis we get that a]? P (a3) seo e) for 
these x; and since 77^?" (r;) = 0, we know that n? (x3) = = 1 for at least 
[k/2] of xj. By the definition of majority function, we know that also 
z? (v) = 1 and thus, by Lemma 2, also n” (v) = 1 40 = a??" (v). 


Therefore r? (0) = T implies 7/?(0) Z 17?" (6) and Faults is a cut set of S. 
Theorem 1. For each fault set Faults, the following two claims are equivalent: 


1. The set Faults is a cut set of S for the set of output signals Vout- 
2. The assignment cs defined as cs(v) = T iff v € Faults is a cut set of SP for 


the top level event V ey... 0 


5 Related Work 


Approaches to the analysis of redundant architectures include [6], which ad- 
dresses the generation of the reliability function for a class of generic architec- 
tures including tree- and DAG-like structures. The computation of the reliability 
is based on predicate abstraction and BDDs. Our work extends and improves 
the approach of [6] in several directions. First, it supports cyclic architectures, 
to which predicate abstraction as defined in [6] cannot be applied. Second, it 
does not require that the redundancy is localized within small blocks (manually 
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defined by the user or in a library), to which the predicate abstraction can be 
applied. In contrast, our approach applies the abstraction directly on the level of 
individual modules and voters. Moreover, the approach of [6] needs to compute 
the abstracted versions of the specified blocks upfront by quantifier elimination. 
Finally, our approach outperforms the approach of [6]. 

Other works on redundant architecture analysis are either based on ad-hoc 
algorithms [13] which are not fully automated, and require discretization and 
additional input data from the user, or use simulation techniques such as Monte 
Carlo analysis [15], which do not examine the system behaviors exhaustively. 

A classification of fault tolerant architectures is presented in [10]. The clas- 
sification is based on three different patterns, namely comparison, voting, and 
sparing, that can be composed to define generic and possibly cyclic architectures. 
A follow-up work [11] builds upon these patterns and introduces strategies to 
evaluate several architectures at once (family-based analysis of redundant ar- 
chitectures) by reduction to Discrete Time Markov Chains. Our techniques are 
orthogonal, and could be applied on top of the approach proposed in [11]. 

The concept of maximally fault-propagating trace used to prove Lemma 3 is 
similar to the concept of maximally diverse interpretations [8], which can be used 
to efficiently reduce a formula in the positive fragment of EUF logic to a SAT 
formula. Both concepts restrict the interpretations of uninterpreted functions to 
a specific subclass, which exhibits all the relevant behaviors. 


6 Experimental Evaluation 


We have performed an experimental evaluation of the proposed approach for 
minimal cut set enumeration in order to answer the following research questions: 


RQ1 How does the new approach scale on redundancy architectures with cycles? 

RQ2 On redundancy architectures with cycles, how do the run-times compare 
against the approach based on the enumeration of minimal cut sets of the 
miter system by a model checker? 

RQ3 On redundancy architectures without cycles, how do the run-times com- 
pare against the approach based on predicate abstraction (PA) and BDD- 
based enumeration [6]? 

RQA On redundancy architectures without cycles, what part of the runtime 
difference is caused by the different reduction to a Boolean problem (FPG vs 
PA) and what part is caused by a different solving approach of the resulting 
Boolean problem (SAT-based vs BDD-based)? 


6.1 Benchmarks and Setup 


To answer these research questions, we used four sets of redundancy systems: 


Scalable cyclic systems This benchmark set contains two kinds of bench- 
marks. For evaluation on redundancy architectures with a linear number 
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Fig. 3: Scalable architectures used in the experimental evaluation. 


of cycles, we have generated ladder-shaped (Figure 3a) architectures of all 
lengths between 1 and 100. For evaluation on redundancy architectures with 
a large number of cycles, we have generated radiator-shaped (Figure 3b) ar- 
chitectures of all lengths between 1 and 50. For each of the architectures, we 
have generated its three redundant versions by replacing each module by a 
TMR block with one to three voters by using schemas from Figures 2b, 2d, 
and 2e. This yields systems with 2 - length - (3 + num Voters) signals. 

Random cyclic systems We have generated 250 random cyclic redundancy 
UF+V systems with 1 to 150 modules of arity between 1 and 3, randomly 
generated 1 to 6 replicas of each module, and 1 to 6 voters of arity 3 or 5, 
randomly connected to the replicas. 

Scalable acyclic systems This benchmark set contains linear-shaped (Fig- 
ure 3c) and rectangular-shaped (Figure 3d) architectures of all lengths be- 
tween 1 and 200 that were used for evaluation of predicate abstraction tech- 
nique [6]. As in the original paper, we have used redundant versions of the 
systems with the modules replaced by a TMR block with one to three voters. 

Random acyclic systems We have used randomly generated acyclic architec- 
tures composed of randomly chosen TMR blocks that were also used in [6]. 


We have evaluated the following approaches for minimal cut set enumeration: 


— For the systems with cycles, we have generated their FPG version as described 
in Section 4 and also the UFLRA transition system implementing the miter 
construction in the SMV format, For enumeration of the minimal cut sets 
of the fault propagation graphs, we have used the tool SMT-PGFDS [7] 
(denoted as FPG in the experiments); for enumeration of the minimal cut 
sets of miter systems, we have used the tool xSAP [2], which internally uses 
an algorithm based on parametric IC3 [3] (denoted as ParamIC3). 

— For the systems without cycles, we have generated both their FPG version and 
the description in the format of the tool OCRA [9] as used in [6]. Although 
the FPGs could be solved by the tool SMT-PGFDS and the OCRA systems 
can be solved by predicate abstraction, which is implemented in xSAP, and 
its BDD-based engine [6], this would not compare only the effect of the re- 
duction to the Boolean case, but also a confounding factor of the underlying 
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Fig.4: Solving time on ladder-shaped benchmarks. Divided according to the 
number of voters per one reference module. 


tmn 


Method 


* FPG 
* ParamiC3 


Time (s) 
vm 


EE 


0 10 20 30 40 500 10 20 30 40 500 10 20 30 40 50 
Size of the architecture 


Fig. 5: Solving time on radiator-shaped benchmarks. Divided according to the 
number of voters per one reference module. 


backend (SAT-based in SMT-PGFDS and BDD-based in xSAP). To answer 
RQ4, we have thus performed more fine-grained analysis as follows. 

From each FPG, we generated the corresponding Boolean formula, which 
is possible since the graph is acyclic [7]. We also generated the Boolean 
formula obtained by predicate abstraction from each OCRA encoding. We 
thus obtained two Boolean formulas for each system: one by reduction to 
fault propagation (FP), and one by reduction by predicate abstraction (PA). 
We have then used the SAT-based enumeration algorithm of SMT-PGFDS 
and also BDD-based enumeration algorithm of xSAP on both of these Boolean 
formulas. This gives 4 combinations: FP-SAT, FP-BDD, PA-SAT, PA-BDD. 


All experiments were executed on a cluster of 9 computational nodes, each 
with Intel Xeon CPU X5650 @ 2.67GHz, 12 cpu and 96 GiB of RAM. We have 
used time limit 1 hour of wall-clock time and memory limit 16 GB for each 
benchmark-solver pair. The detailed experimental results can be found at https: 
/ [es-static.fbk.eu/people/mjonas/papers/tacas22 redarchs/. 


6.2 Results for Cyclic Benchmarks 


'The comparison of running times of FPG-based and of model-checking-based ap- 
proaches on the scalable cyclic benchmarks is shown in Figures 4 and 5. Figure 4 
shows a significant benefit of the technique based on fault propagation on the 
ladder-shaped benchmarks; not only that it can enumerate cut sets of all the used 
benchmarks, but its run-times are dramatically better. However, as can be seen 
on Figure 5, the situation is different on the radiator-shaped benchmarks, which 
contain a large number of cycles. Although the performance of technique based 
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Fig. 7: Solving time on scalable acyclic benchmarks. Divided by the architecture 
and number of voters per one reference module. 


on fault propagation is still superior to the model-checking-based technique, it 
scales poorly on the systems with 2 and 3 voters per one TMR block. The answer 
to RQ1 is thus that the proposed approach scales well if the number of cycles in 
the system is not too large; if the number of cycles is large, the technique scales 
worse, but nevertheless significantly better than the state-of-the-art technique 
based on miter construction and model checking [3]. 

The run-times on random cyclic benchmarks are 
shown in Figure 6. The figure shows that the perfor- 
mance of the proposed technique is better by sev- TiO 7 
eral orders of magnitude and can enumerate mini- : 
mal cut sets of 59 random systems that are out of : 
reach for the technique based on model checking. 
Note that some of the systems are hard for both of ! 
the approaches: both approaches timed out on 66 i ; | 
of the 250 benchmarks. Together with the results On eT 
for the ladder-shaped and radiator-shaped systems, en nili 
this answers RQ2: the technique proposed in this 
paper has significantly better performance than the Fig.6: Solving time on 
state-of-the-art technique based on model checking. random cyclic  bench- 

There are two reasons of the observed perfor- marks. 
mance difference. First is the reduction of UFLRA 
transition system to the Boolean one, which has 
been also observed to bring significant benefit on acyclic systems in the case of 
predicate abstraction [6]. Second is the underlying Mcs-enumeration technique 
applied the resulting FPG. This technique reduces the expensive sequential rea- 
soning to an enumeration of minimal models of a single SMT formula, which can 
significantly improve performance [7]. 


Time FPG (s) 
So 


6.3 Results for Acyclic Benchmarks 


'The comparison of the performance on acyclic scalable benchmarks is shown in 
Figure 7. The results are divided according to the method used to reduce the 
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Fig. 8: Solving time on random acyclic benchmarks. 


problem to Boolean case (FP vs. PA) and the technique used to enumerate the 
minimal cut sets of the Boolean system (SAT vs. BDD). Scatter plots of solving 
times on random acyclic benchmarks can be seen on Figure 8. 

The results show that the reduction of the problem to fault propagation and 
using an off-the-shelf solver for enumeration of minimal cut sets of the resulting 
Boolean system (i.e., FP-SAT) is clearly superior to the state-of-the-art approach 
based on predicate abstraction and BDD-based MCS enumeration (i.e., PA-BDD). 
The difference between these two approaches is even several orders of magnitude 
on scalable benchmarks and grows with the size of the system and its complexity. 
The performance is also significantly better on the random benchmarks. This 
answers RQ3 in favor of the technique proposed in this paper. 

As for RQ4, Figures 7 and 8 show that both the different reduction technique 
(FP vs. PA) and the solving technique (SAT vs. BDD) play a role in this differ- 
ence. However, the larger part of the runtime difference between the proposed 
approach (FP-SAT) and the state-of-the-art approach (PA-BDD) [6] is due to bet- 
ter performance of SAT-based enumeration. This insight is additional interesting 
outcome of our our experiments. Nevertheless, for both of the enumeration ap- 
proaches, the proposed reduction based on fault propagation provides better 
performance than the state-of-the-art reduction by predicate abstraction. 


7 Conclusions and Future Work 


We have presented a framework for modeling redundancy architectures with 
possible cyclic dependencies among the computational modules and we have 
developed an efficient approach for enumeration of minimal cut sets of such 
architectures. The experimental evaluation has shown that this approach dra- 
matically outperforms the state-of-the-art approach based on model checking on 
cyclic redundancy architectures and has a better performance than the state-of- 
the-art approach based on predicate abstraction on acyclic architectures. 

In the future, we plan to extend the approach to a more general class of voters 
than majority voters. We also plan to extend the approach to support common 
cause analysis for different component faults and possibly to synthesize an opti- 
mal distribution of the modules of the architecture between the computational 
nodes of a system such as Integrated Modular Avionics. 
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Abstract. We follow up on the idea of Lars Arge to rephrase the Reduce 
and Apply operations of Binary Decision Diagrams (BDDs) as iterative 
I/O-efficient algorithms. We identify multiple avenues to simplify and 
improve the performance of his proposed algorithms. Furthermore, we 
extend the technique to other common BDD operations, many of which 
are not derivable using Apply operations alone. We provide asymptotic 
improvements to the few procedures that can be derived using Apply. 
Our work has culminated in a BDD package named Adiar that is able 
to efficiently manipulate BDDs that outgrow main memory. This makes 
Adiar surpass the limits of conventional BDD packages that use recur- 
sive depth-first algorithms. It is able to do so while still achieving a sat- 
isfactory performance compared to other BDD packages: Adiar, in parts 
using the disk, is on instances larger than 9.5 GiB only 1.47 to 3.69 times 
slower compared to CUDD and Sylvan, exclusively using main memory. 
Yet, Adiar is able to obtain this performance at a fraction of the main 
memory needed by conventional BDD packages to function. 


Keywords: Time-forward Processing - External Memory Algorithms - 
Binary Decision Diagrams 


1 Introduction 


A Binary Decision Diagram (BDD) provides a canonical and concise representa- 
tion of a boolean function as an acyclic rooted graph. This turns manipulation 
of boolean functions into manipulation of graphs [10, 11]. 

Their ability to compress the representation of a boolean function has made 
them widely used within the field of verification. BDDs have especially found use 
in model checking, since they can efficiently represent both the set of states and 
the state-transition function [11]. Examples are the symbolic model checkers 
NuSMV [14,15], MCK [17], LTSmin [19], and MCMAS [24] and the recently 
envisioned symbolic model checking algorithms for CTL* in [3] and for CTLK 
in [18]. Hence, continuous research effort is devoted to improve the performance 
of this data structure. For example, despite the fact that BDDs were initially 
envisioned back in 1986, BDD manipulation was first parallelised in 2014 by 
Velev and Gao [35] for the GPU and in 2016 by Van Dijk and Van de Pol [16] 
for multi-core processors [12]. 


© The Author(s) 2022 
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The most widely used implementations of decision diagrams make use of 
recursive depth-first algorithms and a unique node table [16, 23,34]. Lookup of 
nodes in this table and following pointers in the data structure during recursion 
both pause the entire computation while missing data is fetched [21,26]. For large 
enough instances, data has to reside on disk and the resulting I/O-operations 
that ensue become the bottle-neck. So in practice, the limit of the computer's 
main memory becomes the upper limit on the size of the BDDs. 


Related Work. Prior work has been done to overcome the I/Os spent while 
computing on BDDs. David Long [25] achieved a performance increase of a fac- 
tor of two by blocking all nodes in the unique node table based on their time 
of creation, i.e. with a depth-first blocking. But, in [6] this was shown to only 
improve the worst-case behaviour by a constant. Ochi, Yasuoka, and Yajima [28] 
designed in 1993 breadth-first BDD algorithms that exploit a levelwise locality 
on disk. Their technique was improved by Ashar and Cheong [8] in 1994 and 
by Sanghavi et al. [31] in 1996. The fruits of their labour was the BDD library 
CAL capable of manipulating BDDs larger than available main memory. Kun- 
kle, Slavici and Cooperman [22] extended in 2010 the breadth-first approach to 
distributed BDD manipulation. 

The breadth-first algorithms in [8,28,31] are not optimal in the I/O-model, 
since they still use a single hash table for each level. This works well in practice, 
as long as a single level of the BDD can fit into main memory. If not, they still 
exhibit the same worst-case I/O behaviour as other algorithms [6]. 

In 1995, Arge [5, 6] proposed optimal I/O algorithms for the basic BDD 
operations Apply and Reduce. To this end, he dropped all use of hash tables. 
Instead, he exploited a total and topological ordering of all nodes within the 
graph. This is used to store all recursion requests in priority queues, so they 
get synchronized with the iteration through the sorted input stream of nodes. 
Martin $mérek implemented these algorithms in 2009 as they were described, 
but the performance was disappointing, since the intermediate unreduced BDD 
grew too large to handle in practice [personal communication, Sep 2021]. 


Contributions. Our work directly follows up on the theoretical contributions 
of Arge in [5,6]. We simplified and improved on his I/O-optimal Apply and 
Reduce algorithms. In particular, we modified and pruned the intermediate rep- 
resentation, to prevent data duplication and to save on the number of sorting 
operations. We also provide I/O-efficient versions of several other standard BDD 
operations, where we obtain asymptotic improvements for the operations that 
are derivable from Apply. 

Our proposed algorithms and data structures have been implemented to cre- 
ate a new easy-to-use and open-source BDD package, named Adiar. Our experi- 
mental evaluation shows that Adiar is able to manipulate BDDs larger than the 
given main memory available, with only an acceptable slowdown compared to a 
conventional BDD library running exclusively in main memory. 
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1.1 Overview 


The rest of the paper is organised as follows. Section 2 covers preliminaries on 
the I/O-model and Binary Decision Diagrams. We present our algorithms for 
I/O-efficient BDD manipulation in Section 3. Section 4 provides an overview 
of the resulting BDD package, Adiar, and Section 5 contains an experimental 
evaluation of it. Our conclusions and future work are in Section 6. 


2 Preliminaries 


2.1 The I/O-Model 


The I/O-model [1] allows one to reason about the number of data transfers be- 
tween two levels of the memory hierarchy, while abstracting away from technical 
details of the hardware, to make a theoretical analysis manageable. 

An I/O-algorithm takes inputs of size N, residing on the higher level of 
the two, i.e. in external storage (e.g. on a disk). The algorithm can only do 
computations on data that reside on the lower level, ie. in internal storage 
(e.g. main memory). This internal storage can only hold a smaller and finite 
number of M elements. Data is transferred between these two levels in blocks 
of B consecutive elements [1]. Here, B is a constant size not only encapsulating 
the page size or the size of a cache-line but more generally how expensive it is 
to transfer information between the two levels. The cost of an algorithm is the 
number of data transfers, i.e. the number of I/O-operations, or just I/Os, it uses. 

For all realistic values of N, M, and B we have that N/B < sort(N) « N, 
where sort(N) = N/B- logy,p(N/B) [1,7] is the sorting lower bound, i.e. it 
takes (2(sort(N)) I/Os in the worst-case to sort a list of N elements [1]. With 
an M/B-way merge sort algorithm, one can obtain an optimal O(sort(N)) I/O 
sorting algorithm [1], and with the addition of buffers to lazily update a tree 
structure, one can obtain an I/O-efficient priority queue capable of inserting 
and extracting N elements in O(sort(N)) I/Os [4]. 


TPIE. The TPIE library [36] provides an implementation of I/O-efficient al- 
gorithms and data structures such that the use of B-sized buffers is completely 
transparent to the programmer. Elements can be stored in files that act like 
lists. One can push new elements to the end of a file and read the next ele- 
ments from the file in either direction, provided has. next returns true. One can 
also peek the next element without moving the read head. TPIE provides an 
optimal O(sort(N)) external memory merge sort algorithm for its files. Further- 
more, it provides an implementation of the I/O-efficient priority queue of [30] as 
developed in [29], which supports the push, top and pop operations. 


2.2 Binary Decision Diagrams 


A Binary Decision Diagram (BDD) [10], as depicted in Fig. 1, is a rooted directed 
acyclic graph (DAG) that concisely represents a boolean function B” — B, 
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Fig.1: Examples of Reduced Ordered Binary Decision Diagrams. Leaves are 
drawn as boxes with the boolean value and internal nodes as circles with the 
decision variable. Low edges are drawn dashed while high edges are solid. 


where B = {T, L}. The leaves contain the boolean values L and T that define 
the output of the function. Each internal node contains the label i of the input 
variable x; it represents, together with two outgoing arcs: a low arc for when 
zi = L and a high arc for when x; = T. We only consider Ordered Binary 
Decision Diagrams (OBDD), where each unique label may only occur once and 
the labels must occur in sorted order on all paths. The set of all nodes with label 
j is said to belong to the jth level in the DAG. 

If one exhaustively (1) skips all nodes with identical children and (2) removes 
any duplicate nodes, then one obtains the Reduced Ordered Binary Decision Di- 
agram (ROBDD) of the given OBDD. If the variable order is fixed, this reduced 
OBDD is a unique canonical form of the function it represents [10]. 

The two primary algorithms for BDD manipulation are called Apply and 
Reduce. The Apply computes the OBDD h = fg where f and g are OBDDs and 
© is a function B x B — B. This is essentially done by recursively computing the 
product construction of the two BDDs f and g and applying © when recursing 
to pairs of leaves. The Reduce applies the two reduction rules on an OBDD 
bottom-up to obtain the corresponding ROBDD [10]. 

Common implementations of BDDs use recursive depth-first procedures that 
traverse the BDD and the unique nodes are managed through a hash table [9, 
16,20,23,34]. The latter allows one to directly incorporate the Reduce algorithm 
of [10] within each node lookup [9,27]. They also use a memoisation table to 
minimise the number of duplicate computations [16, 23, 34]. If the size Ny and 
Ng of two BDDs are considerably larger than the memory M available, each 
recursion request of the Apply algorithm will in the worst case result in an I/O, 
caused by looking up a node within the memoisation and following the low and 
high arcs [6,21]. Since there are up to Ny - Ng recursion requests, this results in 
up to O(N; - N) I/Os in the worst case. The Reduce operation transparently 
built into the unique node table with a find-or-insert function can also cause an 
I/O for each lookup within this table [21]. This adds yet another O(N) I/Os, 
where N is the number of nodes in the unreduced BDD. 

Lars Arge provided in [5,6] a description of an Apply algorithm that is capable 
of only using O(sort(N; - N,)) I/Os and a Reduce that uses O(sort(N)) I/Os 
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(see [6] for a detailed description). He also proved this to be optimal for both 
algorithms, assuming a levelwise ordering of nodes on disk [6]. Our algorithms, 
implemented in Adiar, differ from Arge's in subtle non-trivial ways. We will not 
elaborate further on his original proposal, since our algorithms are simpler and 
better at conveying the time-forward processing technique he used. Instead, we 
will mention where our Reduce and Apply algorithms differ from his. 


3 BDD Manipulation by Time-forward Processing 


Our algorithms exploit the total and topological ordering of the internal nodes 
in the BDD depicted in (1) below, where parents precede their children. It is 
topological by ordering a node by its label, i : N, and total by secondly ordering 
on a node's identifier, id : N. 'This identifier only needs to be unique on each 
level as nodes are still uniquely identifiable by the combination of their label and 
identifier. 

(i, idı) < (i2, id2) = îi < t2 V (i = i2 A idı < id2) (1) 


We write the unique identifier (i, id) : N x N for a node as zi ia. 

BDD nodes do not contain an explicit pointer to their children but instead 
the children’s unique identifier. Following the same notion, leaf values are stored 
directly in the leaf’s parents. This makes a node a triple (uid, low, high) where 
uid : NxN is its unique identifier and low and high : (N x N) +B are its children. 
The ordering in (1) is lifted to compare the uids of two nodes, and so a BDD is 
represented by a file with BDD nodes in sorted order. For example, the BDDs 
in Fig. 1 would be represented as the lists depicted in Fig. 2. 

The Apply algorithm in [6] produces an unreduced OBDD, which is turned 
into an ROBDD with Reduce. The original algorithms of Arge solely work on a 
node-based representation. Arge briefly notes that with an arc-based represen- 
tation, the Apply algorithm is able to output its arcs in the order needed by the 
following Reduce, and vice versa. Here, an arc is a triple (source, is high, target) 


(written as source AM target) where source : N x N, is high : B, and tar- 
get : (N x N)-- B, i.e. source and target contain the level and identifier of internal 
nodes. We have further pursued this idea of an arc-based representation and can 
conclude that the algorithms indeed become simpler and more efficient with an 
arc-based output from Apply. On the other hand, we see no such benefit over 
the more compact node-based representation in the case of Reduce. Hence as 
is depicted in Fig. 3, our algorithms work in tandem by cycling between the 
node-based and arc-based representation. 


la: [ (z2,9, L, T) ] 

1b: [ (£z0,0, L, %1,0) (210, L, T)] 

le: [ (£0,0, 21,0, £1,1) , (%1,0, L, T) , (x14, T, L) ] 
1d: [ (21,0; X2.0; T) E (22,0, L, T) ] 


Fig. 2: In-order representation of BDDs of Fig. 1 
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internal arcs 
f nodes —* E 


Apply f Og arcs Reduce Į— f © g nodes 
g nodes —» » 
leaf arcs 


Fig. 3: The Apply-Reduce pipeline of our proposed algorithms 


(22,0; 10,0) 
o 22,0, 21,0) 
l H 
(22,0, L) 22,0, T) internal arcs leaf arcs 
T I F 
T2,0 [ 30,0 — Zi) ; | 22,0 => ; 
L Te 
21,0 —? 22,0 , T2071, 
Ta \ EN E ems 
z ETE T0,0 —> 22,0 ; $21 — |, 
T A TTE T = 
Z1,0 — 2241 ] X231 — ] 


(a) Semi-transposed graph. (pairs indicate (b) In-order arc-based representation. 
nodes in Fig. 1a and 1b, respectively) 


Fig. 4: Unreduced output of Apply when computing x2 = (xo ^ 21) 


Notice that our Apply outputs two files containing arcs: arcs to internal 
nodes (blue) and arcs to leaves (red). Internal arcs are output at the time their 
targets are processed, and since nodes are processed in ascending order, internal 
arcs end up being sorted with respect to the unique identifier of their target. 
This groups all in-going arcs to the same node together and effectively reverses 
internal arcs. Arcs to leaves, on the other hand, are output when their source is 
processed, which groups all out-going arcs to leaves together. These two outputs 
of Apply represent a semi-transposed graph, which is exactly of the form needed 
by the following Reduce. For example, the Apply on the node-based ROBDDs in 
Fig. 1a and 1b with logical implication as the operator will yield the arc-based 
unreduced OBDD depicted in Fig. 4. 

For simplicity, we will ignore any cases of leaf-only BDDs in our presentation 
of the algorithms. They are easily extended to also deal with those cases. 


3.1 Apply 


Our Apply algorithm works by a single top-down sweep through the input DAGs. 
Internal arcs are reversed due to this top-down nature, since an arc between two 
internal nodes can first be resolved and output at the time of the arc's target. 
These arcs are placed in the file Finternaz. Arcs from nodes to leaves are placed 
in the file Fea. 

The algorithm itself essentially works like the standard Apply algorithm. 
Given a recursion request for a pair of input nodes vy from f and v, from g, 
a single node is created with label min(v;.uid.label, v;.uid.label) and recursion 
requests flow and Thigh are created for its two children. If the label of v;.uid and 
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Apply(f, g, ©) 
F internal = []; Freaf = NE Qapp:1 EE 


P Qapp2 €— 0 
vy + f.next(); vg — g.next(); id — 


0; label — undefined 


undefined 


Qapp:1 - push (NIL ———-> (vy.uid, vg.uid)) 


/* Process requests in topological order */ 


1 

2 

3 

4 

5 /* Insert request for root (vp,ug) x/ 
6 

7 

8 

9 while Qapp:1 49 V Qapp:2 #9 do 


is_high 


10 (s ——5 (ty, tg), low, high) — TopOf(Qapp:1, Qapp:2) 

11 

12 tseek € if low, high = NIL then min(ty,t,) else max(ty ,t,) 
13 while v;.uid < tsek A f.has.next() do v; + f.next() od 
14 while vg.uid < tsek ^ g.has.next() do v, + g.next() od 
15 

16 if low = NIL A high = NIL A ty € (1, T) A tg € (L, T 

17 A ty. label = tg.label ^ ty.id A tg.id 

18 then /* Forward information of min(ty,tg) to max(ty,tg) */ 
19 v — if tseer = ve then vy; else vg 

20 while Qapp:1.top() matches _— (ty,t;) do 

21 (s E, (tr. t9)) — Qapp:1 - pop () 

22 Qapp:2 push(s “> (t5,1,), v.low, v.high) 

23 od 

24 else /* Process request (ts,tg) */ 

25 id — if label Æ tseek.label then 0 else id+1 

26 label + tseek . label 

27 

28 /* Forward or output out—going arcs */ 

29 Tiow ; Thigh «- RequestsFor((ty, tg), Uf, vg, low, high, ©) 
30 (if rj, € 11, TJ then Fi; else Qoa). push (Travel ia — riow) 
31 (if rug, € (1, T) then Fieas else Qapp:1 ) . push (1ave1,ia —9 Thigh ) 
32 

33 /* Output in—going arcs x/ 

34 while Qai #0 ^ Qapp:-top() matches (.— (ts,t,)) do 
35 (s cM (tf,tg)) — Qapp:1 - pop () 

36 if s # NIL then Finternal pasha > Travel ia) 

37 od 

38 while Qapp: Z0 ^ Qapp:2.top() matches (.— (ts,tg), ., -) do 
39 (s 2795 (tp, to), -; -) € Qapp:2- pop() 

40 if s # NIL then I denen opis aS Travel ia) 

41 od 

42 od 

43 return Finternal, Pa 


Fig.5: The Apply algorithm 
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vg.uid are equal, then rjj, = (vy.low,vg.low) and ry, = (vr.high, vj.high). 
Otherwise, flow, resp. Thigh, contains the uid of the low child, resp. the high 
child, of min(vy, vg), whereas max(vy.uid, vg. uid) is kept as is. 

The pseudocode for the Apply procedure is shown in Fig. 5, where the Re- 
questsFor function computes rjo,, and Thigh for the pair of nodes (t;,t,). The 
goal of the rest of the algorithm is to obtain the information that RequestsFor 
needs in an I/O-efficient way. To this end, the two priority queues Qapp:1 and 
Qapp:2 are used to synchronise recursion requests for a pair of nodes (tp, tg) with 


the sequential order of reading nodes in f and g. Qapp:1 has elements of the 
is high is_high 


form (s => (ty,tg)) and Qapp:2 has elements (s == (tr, tg), low, high). The 
boolean is. high and the unique identifer s, being the request’s origin, are used 
on lines 33 — 41, to output all ingoing arcs when the request is resolved. 

Elements in Qapp:1 are sorted in ascending order by min(ty,t,), i.e. the node 
encountered first from f and g. Requests to the same (tẹ, tg) are grouped together 
by secondarily sorting the tuple lexicographically. Qapp.2 is sorted in ascending 
order by max(t;,t,), i.e. the second of the two to be visited, and ties are again 
broken lexicographically. This second priority queue is used in the case where 
Lg.label = tg.label but tf.id A t,.id, ie. when both are needed to resolve the 
request but they are not necessarily available at the same time. To this end, the 
given request is moved from Qapp:1 into Qapp:2 on lines 19 — 23. Here, the request 
is extended with the unique identifiers low and high of min(vy, vg), which makes 
the children of min(v;, vg) available at max(vy, vg). 

The next request to process from Qapp:1 OF Qapp:2 is dictated by the TopOf 
function on line 10. In the case that both Qapp.1 and Qapp:2 are non-empty, let 


( ds high] 


rı = (sı ——> (ttt 1)) be the top element of Qapp:1 and let the top element 
of Qapp:2 be ra = (s2 — 3, (t5, t5), low, high). TopOf(Qapp:1, Qapp:2) Te- 
turns (ri, Nil, Nil) if min(tp.1, tg.) < ie Pas: tg:2) and r2 otherwise. If either 
one is empty, then it equivalently outputs the top request of the other. 

'The arc-based output greatly simplifies the algorithm compared to the orig- 
inal proposal of Arge in [6]. Our algorithm only uses two priority queues rather 
than four. Arge's algorithm, like ours, resolves a node before its children, but due 
to the node-based output it has to output this entire node before its children. 
Hence, it has to identify all nodes by the tuple (tf, tg), doubling the space used. 
Instead, the arc-based output allows us to output the information at the time of 
the children and hence we are able to generate the label and its new identifier for 
both parent and child. Arge's algorithm also did not forwarded a requests source 
5, so repeated requests to the same pair of nodes were merely discarded upon 
retrieval from the priority queue, since they carried no relevant information. Our 
arc-based output, on the other hand, makes every element placed in the priority 
queue forward the source s, vital for the creation of the semi-transposed graph. 


Proposition 1 m Arge 1996 [6]). The Apply algorithm in Fig. 5 has 
I/O complexity O(sort(Ny-N,)) and O((N;-N)-log(N;- Ng)) time complexity, 
where Ny and Ng are the respective sizes of the BDDs for f and g. 


See the full paper [33] for the proof. 
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Pruning by shortcutting the operator The Apply procedure above, like 
Arge's original algorithm, follows recursion requests until a pair of leaves is met. 
Yet, for example in Fig. 4 the node for the request (2,9, T) is unnecessary to 
resolve, since all leaves of this subgraph trivially will be T due to the implication 
operator. The subsequent Reduce will remove this node and its children in favour 
of the T leaf. Hence, the RequestsFor function can instead immediately create a 
request for the leaf. We implemented this in Adiar, since it considerably decreases 
the size of Qapp:1, Qapp:2, and of the output. 


3.2 Reduce 


Our Reduce algorithm in Fig. 6 works like other explicit variants with a single 
bottom-up sweep through the OBDD. Since the nodes are resolved and output 
in a bottom-up descending order, the output is exactly in the reverse order as 
it is needed for any following Apply. We have so far ignored this detail, but the 
only change necessary to the Apply algorithm in Section 3.1 is for it to read the 
list of nodes of f and g in reverse. 

The priority queue Qreq is used to forward the reduction result of a node v 
to its parents in an I/O-efficient way. Qreq contains arcs from unresolved sources 
s in the given unreduced OBDD to already resolved targets t/ in the ROBDD 
under construction. The bottom-up traversal corresponds to resolving all nodes 
in descending order. Hence, arcs s SX Y in Qreq are first sorted on s and 
secondly on is high; the latter simplifies retrieving the low and high arcs on lines 
8 and 9. The base-cases for the Reduce algorithm are the arcs to leaves in Fieaf, 
which follow the exact same ordering. Hence, on lines 8 and 9, arcs in Qrea 
and Fi are merged using the PopMax function that retrieves the arc that is 
maximal with respect to this ordering. 

Since nodes are resolved in descending order, Finterna follows this ordering 
on the arc's target when elements are read in reverse. The reversal of arcs in 
Fiinternat Makes the parents of a node v, to which the reduction result is to be 
forwarded, readily available on lines 26 — 32. 

'The algorithm otherwise proceeds similarly to the standard Reduce algorithm 
[10]. For each level j, all nodes v of that level are created from their high and 
low arcs, €nigh and elow, taken out of Qreq and Fieaf. The nodes are split into the 
two temporary files Fj., and Fj.» that contain the mapping [uid —> uid'] from a 
node in the given unreduced OBDD to its equivalent node in the output. Pj 
contains the nodes v removed due to the first reduction rule and is populated 
on lines 7 — 12: if both children of v are the same then [v.uid ++ v.low] is pushed 
to this file. F;.2 contains the mappings for the second rule and is populated on 
lines 15 — 24. Nodes not placed in Pj; are placed in an intermediate file F} 
and sorted by their children. This makes duplicate nodes immediate successors. 
Every unique node encountered in Pj is output to Fout before mapping itself 
and all its duplicates to it in Fj.2. Since nodes are output out-of-order compared 
to the input and it is unknown how many will be output for said level, they 
are given new decreasing identifiers starting from the maximal possible value 
MAX ID. Finally, Pj.) is sorted back in order of PF;,:,:44; to forward the results 
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1 Reduce( Finternai ; Fia) 

2 Fout +[]; Qred — (0 

3 while Qrea #9 do 

4 j — Qrea.top().source.label; id + MAX ID; 

5 Fj e[]; Ei -[]; Fie |] 

6 

T while Qrea.top().source.label = j do 

8 €nigh *- PopMax(Qrea, Fieaf ) 

9 Clow €- PopMax(Qrea, Fur) 

10 if enign. target = elow. target 

11 then Fj:;.push ([emw.source e Ew. target |) 

12 else F;.push((eiw.source, e€iow.target , enign. target )) 
13 od 

14 

15 sort vE Fj by v.low and secondly by v. high 

16 v' — undefined 

17 for each vc F; do 

18 if v is undefined or v.low 4 v'.low or v.high Æ v'.high 
19 then 

20 id — id — 1 

21 v € (ajia, v.low, v.high) 

22 Fout. push (v) 

23 Fj.2.push([v.uid + v'.uid]) 

24 od 

25 

26 sort [uid uwid'|€ Fj2 by uid in descending order 
27 for each [uid — uid] € MergeMaxUid(Fj:1, Fj:2) do 
28 while arcs from Finternai. peek () matches -> uid do 
29 (s SINON uid) «— Finternal. next () 

30 Qrea push (s ERR uid’) 

31 od 

32 od 

33 od 

34 return Fout 


Fig.6: The Reduce algorithm 


in both Pj. and Pj;; to their parents on lines 26 — 32. Here, MergeMaxUid 
merges the mappings [uid +> uid'] in Fj.1 and Fj.3 by always taking the mapping 
with the largest wid from either file. 

Since the original algorithm of Arge in [6] takes a node-based OBDD as 
an input and internally uses node-based auxiliary data structures, his Reduce 
algorithm had to create two copies of the input to reverse all internal arcs: a 
copy sorted by the nodes’ low child and one sorted by their high children. Since 
Finternal already has its arcs reversed, our design eliminates two expensive sorting 
steps and more than halves the memory used. 
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Another consequence of Arge's node-based representation is that his algo- 
rithm had to move all arcs to leaves into Qreq rather than merging requests from 
Qrea with the base-cases from Fear. The semi-transposed input allows us to de- 
crease the number of I/Os due to Qreq by O(sort(.N;)) where N; are the number 
of arcs to leaves (see [33] for the proof). In practice, together with pruning the 
recursion during Apply, this can provide up to a factor 2 speedup [33]. 


Proposition 2 (Following Arge 1996 [6]). The Reduce algorithm in Fig. 6 
has an O(sort(N)) I/O complexity and an O(N log N) time complexity. 


See the full paper [33] for the proof. Arge proved in [6] that this O(sort(N)) 
I/O complexity is optimal for the input, assuming a levelwise ordering of nodes. 


3.3 Other BDD Algorithms 


By applying the above algorithmic techniques, one can obtain all other singly- 
recursive BDD algorithms; see [33] for the details. We now design asymptotically 
better variants of Negation and Equality Checking than what is possible by 
deriving them using Apply. 


Negation A BDD is negated by inverting the value in its nodes’ leaf children. 
This is an O(1) I/O-operation if a negation flag is used to mark whether the 
nodes should be negated on-the-fly as they are read from the stream. 


Proposition 3. Negation has I/O, space, and time complexity O(1). 


This is an improvement over the O(sort(N)) I/Os spent by Apply to compute 
[OG T, where © is exclusive or. Furthermore, disk space is shared between BDDs. 


Equality Checking To check for f = g one has to check the DAG of f being 
isomorphic to the one for g [10]. This makes f and g trivially inequivalent when 
the number of nodes, number of levels, or the label or size of each of the L levels 
do not match. This can be checked in O(1) and O(L/B) I/Os if the Reduce 
algorithm in Fig. 6 is made to also output the relevant meta-information. 

If f 2 g, the isomorphism relates the roots of the BDDs for f and g. For 
any node v; of f and vg of g, if (vf, vg) is uniquely related by the isomorphism, 
then so should (vy.low, vg.low) and (vy.high, v;.high). Hence, one can check for 
equality by traversing the product of both BDDs (as in Apply) and check for 
one of the following two conditions being violated. 


— The children of the given recursion request (tf, tg) should either both be the 
same leaf or an internal node with the same label. 

— On level i, exactly N; unique recursion requests should be retrieved from the 
priority queues, where N; are the number of nodes on level i. 


If the first condition is never violated, it is guaranteed that f = g, and so T 
is output. The second ensures that the algorithm terminates earlier on negative 
cases and lowers the provable complexity bound; see [33] for the proof. 
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Proposition 4. Equality Checking has I/O complexity O(sort(N)) and time 
complexity O(N log N), where N = min(Ny, Ng) is the minimum of the respec- 
tive sizes of the BDDs for f and g. 


If (1) on page 5 is extended such that L, T succeed all unique identifiers and 
L < T, then Fig. 6 actually enforces a much stricter ordering; it outputs nodes 
in an order purely based on their label and the unique identifier of their children. 


Proposition 5. If Gy and G, are outputs of Reduce in Fig. 6, then f = g if 
and only if the ith nodes of Gf and Gg match numerically. 


See the full paper [33] for the proof. The negation operation breaks this property 
by changing the leaf values without changing their order. So, in the case where f 
or g, but not both, have their negation flag set, one still has to use the O(sort(N)) 
algorithm above, but otherwise a simple linear scan of both BDDs suffices. 


Corollary 1. If the negation flag of the BDDs for f and g are equal, then 
Equality Checking can be done in 2- N/B I/Os and O(N) time, where N — 
min( Ny, Ng) is the minimum of the respective sizes of the BDDs for f and g. 


Both Proposition 4 and Corollary 1 are an asymptotic improvement on the 
O(sort(N?)) equality checking algorithm by computing f © g with Apply and 
Reduce and then test whether the output is the T leaf. 


4 Adiar: An Implementation 


The algorithms and data structures described in Section 3 have been imple- 
mented in a new BDD package, named Adiar'^?. The most important opera- 
tions are shown in Table 1. Interaction with the BDD package is done through 
C++ programs that include the <adiar/adiar.h> header file and are built and 
linked with CMake. Its two dependencies are the Boost library and the TPIE 
library; the latter is included as a submodule of the Adiar repository, leaving it 
to CMake to build TPIE and link it to Adiar. 

Adiar is initialised with the adiar init(memory, temp dir) function, where 
memory is the memory (in bytes) dedicated to Adiar and temp dir is the directory 
where temporary files will be placed, e.g. a dedicated harddisk. The BDD package 
is deinitialised by calling the adiar deinit() function. 

The bdd object in Adiar is a container for the underlying files for each BDD, 
while a -bdd object is used for possibly unreduced arc-based OBDDs. Reference 
counting on the underlying files is used to reuse the same files and to immediately 
delete them when the reference count decrements to 0. Files are deleted as early 
as possible by use of implicit conversions between the bdd and __bdd objects and 
an overloaded assignment operator, making the concurrently occupied space on 
disk minimal. 

' adiar ( portuguese ) (verb) : to defer, to postpone 
? Source code is publicly available at github.com/ssoelvsten/adiar 
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Adiar function Operation I/O complexity Justification 
bdd.apply(/,g,C) fog O(sort(N ;.N,)) Prop. 1, 2 
bdd ite(f ,g, h) f?g: h| O(sort(NsNgNn)) |[33], Prop. 2 
bdd restrict(f,i,v)| fla;=v O(sort(N;)) [33], Prop. 2 
bdd exists(,i) Sv : flei=v O(sort(N?)) [33], Prop. 2 
bdd. forall(,i) Vu: f|z;—v O(sort(N7)) [33], Prop. 2 
bdd not(/) af O(1) Prop. 3 
bdd_satcount (f) dic: f(x) O(sort(Ny)) [33] 
bdd. nodecount ( f) Ny O(1) Section 3.3 
EE f=g_ |O(sort(min(N;, N;)))) Prop. 4 


Table 1: Some of the operations supported by Adiar and their I/O-complexity. 


5 Experimental Evaluation 


While time-forwarding may be an asymptotic improvement over the recursive 
approach in the I/O-model, its usability in practice is another question entirely. 
We have compared Adiar 1.0.1 to the recursive BDD packages CUDD 3.0.0 [34] 
and Sylvan 1.5.0 [16] (in single-core mode). We constructed BDDs for some 
benchmarks in all tools in a similar manner, ensuring the same variable ordering. 

The experimental results? were obtained on server nodes of the Grendel clus- 
ter at the Centre for Scientific Computing Aarhus. Each node has two 48-core 3.0 
GHz Intel Xeon Gold 6248R processors, 384 GiB of RAM, 3.5 TiB of available 
SSD disk, run CentOS Linux, and compile code with GCC 10.1.0. We report the 
minimum measured running time, since it minimises any error caused by the 
CPU, memory and disk [13]; using the average or median does not significantly 
change any of our results. For comparability all compute nodes are set to use 
350 GiB of the available RAM, while each BDD package is given 300 GiB of it. 
Sylvan was set to not use any parallelisation, given a ratio between the node 
table and the cache of 64:1 and set to start its data structures 2!? times smaller 
than the final 262 GiB it may occupy, i.e. at first with a table and cache that 
occupies 66 MiB. The size of the CUDD cache was set such it would have the 
same node table to cache ratio when reaching 300 GiB. 


5.1 Queens 


The solution to the Queens problem is the number of arrangements of N queens 
on an N x N board, such that no queen is threatened by another. Our bench- 
mark follows the description in [22]: the variable z;; represents whether a queen 
is placed on the ith row and the jth column and the solution to the prob- 
lem then corresponds to the number of satisfying assignments to the formula 
Aca MESS (zij ^ has .threat(i, j)), where has .threat(i, j) is true, if a queen is 
placed on a tile (k,l), that would be in conflict with a queen placed on (i, j). 


3 Available at Zenodo [32] and at github.com/ssoelvsten/bdd-benchmark 
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Fig. 7: Running time solving N-Queens (lower is better). 


'The ROBDD of the innermost conjunction can be directly constructed, without 
any BDD operations. 

The current version of Adiar is implemented purely using external memory 
algorithms. These perform poorly when given small amounts of data. Hence, it 
is not meaningful to compare performance for N « 12 where the BDDs involved 
are 23.5 MiB or smaller. For N > 12, Fig. 7 shows how the gap in running time 
between Adiar and other BDD packages shrinks as instances grow. At N = 15, 
which is the largest instance solved by Sylvan and CUDD, Adiar is 1.47 times 
slower than CUDD and 2.15 times slower than Sylvan. 

The largest instance solved by Adiar is N — 17 where the largest BDD 
constructed is 719 GiB in size. In contrast, Sylvan only constructed a 12.9 GiB 
sized BDD for N — 15. Even though Adiar has to use disk, it only becomes 
1.8 times slower per processed node compared to its highest performance at 
N = 13. Conversely, Adiar is able to solve the N = 15 problem with much less 
main memory than both Sylvan and CUDD. Fig. 8 shows the running time on 
the same machine with its memory, including its file system cache, limited with 
cgroups to be 1 GiB more than given to the BDD package. Yet, Adiar is only 1.39 
times slower when decreasing its memory down to 2 GiB, while Sylvan cannot 
function with less than 56 GiB of memory available. 


5,000 |- 4 
4,000 |- -| 
a 3,000 |- Hepa soso sea 
2,000 |- n ———— ——— 
1,000 |- 4 
gt | | | | | —— Adiar 
0 50 100 150 200 250 —— CUDD 


Memory (GiB) —— Sylvan 


Fig. 8: Running time of 15-Queens with variable memory (lower is better). 
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We also ran experiments on counting the number of draw positions in a 3D- 
version of Tic- Tac- Toe, derived from [22]. Our results [33] paint a similar picture: 
Adiar is only 2.50 times slower than Sylvan for Sylvan's largest solved instance; 
Sylvan only creates BDDs of up to 34.4 GiB in size, whereas Adiar constructs 
a 902 GiB sized BDD; Adiar only slows down by a factor of 2.49 per processed 
node when using the disk extensively to solve the larger instances. 


5.2 Combinatorial Circuit Verification 


The EPFL Combinational Benchmark Suite [2] consists of 23 combinatorial cir- 
cuits designed for logic optimisation and synthesis. 20 of these are split into the 
two categories random/control and arithmetic, and each of these original cir- 
cuits C, is distributed together with one circuit optimised for size C, and one 
circuit optimised for depth Ca. The last three are the More than a Million Gates 
benchmarks, which we will ignore as they come without optimised versions. 

Based on the approach of the Nanotrav program as distributed with CUDD, 
we verify the functional equivalence between each output gate of C, and the 
corresponding gate in each optimised circuits C4, and C's. The BDDs are com- 
puted by representing every input gate by a decision variable, and computing the 
BDD of all other gates from the BDDs of their input wires. Finally, the BDDs 
for every pair of corresponding output gates are tested for equality. Memoisation 
ensures that the same gate is not computed twice, while a reference counter is 
maintained for each gate such that dead BDDs in the memoisation table may 
be garbage collected. Recall that Adiar stores each BDD in a separate file, while 
Sylvan and CUDD share nodes between different BDDs in a forest. 

Table 2 shows the number of verified instances with each BDD package within 
a 15 days time limit. Adiar is able to verify three more benchmarks than both 
other BDD packages. This is despite the fact that most instances include hun- 
dreds of concurrent BDDs, while the disk is only 12 times larger than main 
memory. For example, the largest verified benchmark, mem_ctrl, has up to 1231 
BDDs existing at the same time. 

Table 3 shows the time it took Adiar to verify equality between the original 
and each of the optimised circuits, for the three largest cases verified. The table 
also shows the sum of the sizes of the output BDDs that represent each circuit. 
Throughout all solved benchmarks, equality checking took less than 1.4796 of 
the total construction time and the O(N/B) algorithm could be used in 71.696 
of all BDD comparisons. The voter benchmark with its single output shows that 


# solved|# out-of-space 4 time-out 
Adiar 23 6 11 
CUDD 20 19 1 
Sylvan 20 13 


Table 2: Number of verified arithmetic and random/control circuits from [2] 
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depth size depth size depth size 
Time (s) 5862 5868 Time (s) 3.89 3.27 Time (s)  |0.058 0.006 
O(sort(N))| 496 476  O(sort(N))| 22 22  O(sort(N))| 1 0 
O(N/B) 735 755 O(N/B) 3 3 O(N/B) 0 1 
N (MiB) | 614313 N (MiB) 3580  N (MiB) 5.74 
(a) mem_ctrl (b) sin (c) voter 


Table 3: Running time for equivalence testing. O(sort(V)) and O(N/B) is the 
number of times the respective algorithm in Section 3.3 was used. 


the O(N/B) algorithm is about 10 times faster than the O(sort(V)) algorithm 
and can compare at least 2 - 5.75 MiB/0.006 s = 1.86 GiB/s. 


6 Conclusions and Future Work 


Adiar provides an I/O-efficient implementation of BDDs. The iterative BDD 
algorithms exploit a topological ordering of the BDD nodes in external memory, 
by use of priority queues and sorting algorithms. All recursion requests for a 
single node are processed together, eliminating the need for a memoisation table. 
The performance of Adiar is very promising in practice for instances larger 
than a few hundred MiB. As the size of the BDDs increase, the performance of 
Adiar gets closer to conventional recursive BDD implementations — for BDDs 
larger than a few GiB the use of Adiar has at most resulted in a 3.69 factor 
slowdown. Simultaneously, the design of our algorithms allow us to compute on 
BDDs that outgrow main memory with only a 2.49 factor slowdown, which is 
negligible compared to use of swap memory with conventional BDD packages. 
This performance comes at the cost of Adiar not being able to share nodes 
between BDDs. Yet, this increase in space usage is not a problem in practice 
and it makes garbage collection a trivial and cheap deletion of files on disk. On 
the other hand, the lack of sharing makes it impossible to check for functional 
equivalence with a mere pointer comparison. Instead, one has to explicitly check 
for the two DAGS being isomorphic. We have improved the asymptotic and 
practical performance of equality checking such that it is negligible in practice. 
This lays the foundation on which we intend to develop external memory ver- 
sions of the BDD algorithms that are still missing for symbolic model checking. 
Specifically, we intend to improve the performance of quantifying multiple vari- 
ables and designing a relational product operation. Furthermore, we will improve 
performance for small instances that fit entirely into internal memory. 
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Abstract. In this paper, we present Forest GUMP (for Generalized, 
Unifying Merge Process) a tool for providing tangible experience with 
three concepts of explanation. Besides the well-known model explanation 
and outcome explanation, Forest GUMP also supports class character- 
ization, i.e., the precise characterization of all samples with the same 
classification. Key technology to achieve these results is algebraic aggre- 
gation, i.e., the transformation of a Random Forest into a semantically 
equivalent, concise white-box representation in terms of Algebraic Deci- 
sion Diagrams (ADDs). The paper sketches the method and illustrates 
the use of Forest GUMP along an illustrative example taken from the 
literature. This way readers should acquire an intuition about the tool, 
and the way how it should be used to increase the understanding not 
only of the considered dataset, but also of the character of Random 
Forests and the ADD technology, here enriched to comprise infeasible 
path elimination. 


Keywords: Random Forest, Binary/Algebraic Decision Diagram, Ag- 
gregation, Infeasible Paths, Explainability, Random Seed 


1 Introduction 


Random Forests are one of the most widely known classifiers in machine learn- 
ing [3,17]. The method is easy to understand and implement, and at the same 
time achieves impressive classification accuracies in many applications. Com- 
pared to other methods, Random Forests are fast to train and they are clearly 
more suitable for smaller datasets. In contrast to a single decision tree, Random 
Forests, a collection of many trees, do not overfit as easily on a dataset and their 
variance decreases with their size. On the other hand, Random Forests are con- 
sidered black-box models because of their highly parallel nature: following the 
execution of Random Forests means, in particular, following the execution in all 
the involved trees. Such black-box executions are hard to explain to a human 
user even for very small examples. 

In contrast, decision trees are considered white-box models because of their 
sequential evaluation nature. Even if a tree is large in size, a human can easily 
follow its computation step by step by evaluating (simple) decisions at each node 
from the root to a leaf. Indeed, the set of decisions along such an execution path 
precisely explains why a certain choice has been taken. 
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Popular methods towards explainability try to establish some user intuition. 
For example, they may hint at the most influential input data, like highlighting 
or framing the area of a picture where a face has been identified. Such informa- 
tion is very helpful, and it helps in particular to reveal some of the “popular” 
drastic mismatches incurred by neural networks: if the framed area of the image 
does not contain the “tagged” object, the identification is clearly questionable. 
However, even in a correct classification, the tag by itself gives no reason why 
the identification is indeed correct. 

More ambitious are methods that try to turn black-box model into white- 
box models, ideally preserving the semantics of the classification function. For 
Random Forests this has been achieved for the first time in [10,14] using the 'ag- 
gregating power’ of Algebraic Decision Diagrams (ADDs) and Binary Decision 
Diagrams (BDDs). ADDs are essentially decision trees whose leaves are labelled 
with elements of some algebra, whereas BBDs are the special case for the al- 
gebra of Boolean values. Lifting the algebraic operations from the leaves to the 
entire ADDs/BDDs allows one to aggregate entire Random Forests into single 
semantically equivalent ADDs, the precondition for solving three explainability 
problems: 


— The Model Explanation Problem |15], i.e. the problem of making the model 
as a whole interpretable, is solved in terms of an ADD that specifies pre- 
cisely the same classification function as the original Random Forest (cf. 
Section 6.2). 

— The Class Characterization Problem, i.e. the problem, given a class c, char- 
acterizing the set of all samples that are classified by the Random Forest as 
c. This problem is solved in terms of a BDD which precisely characterizes 
this set of samples (cf. Section 6.3). 

— The Outcome Explanation Problem [15], i.e. the problem of explaining a con- 
crete classification, is solved in terms of a minimal conjunction of (negated) 
decisions that are sufficient to guide the sample into the considered class (cf. 
Section 6.4). 


In this paper, we present Forest GUMP (for Generalized, Unifying Merge Pro- 
cess) a tool for providing a tangible experience with the described concepts of 
explanation. Experimentation with Forest GUMP does not only yield semanti- 
cally equivalent, concise white-box representations for a given Random Forest 
which reveal characteristics of the underlying datasets, but it also allows one to 
experience, e.g., the impact of random seeds on both the quality of prediction 
and the size of the explaining models (cf. Section 6). Our implementation relies 
on the standard Random Forest implementation in Weka [28] and on the ADD 
implementation of the ADD-Lib [9,12,26]. For a more detailed description of the 
transformations and a quantitative analysis we refer the reader to [10,11,14]. 


Related Work: Various methods for making Random Forests interpretable exist 
such as extracting decision rules from the considered black-box model [6], meth- 
ods that are agnostic to the black-box model under consideration [20,24] or by 
deriving a single decision tree from the black-box model [5,7,16,27,29]. In this 
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context, single decision trees are considered key to a solution of both, the model 
explanation and outcome explanation problem. State of the art solutions to de- 
rive a single decision tree from a Random Forest are approximative [5,7,16,27,29]. 
Thus, their derived explanations are not fully faithful to the original semantics 
of the considered Random Forest. This is in contrast to our ADD-based aggre- 
gation, which precisely reflects the semantics of the original Random Forest. 


After a short introduction to Random Forests in Section 2, we present our ap- 
proach to their aggregation in Section 3 which is followed by an elimination 
of redundant predicates from the decision diagrams in Section 4 and a non- 
compositional abstraction in Section 5. Section 6 introduces Forest GUMP and 
solutions to the three explainability problems. In the end, we summarize the 
lessons we have learned using Forest GUMP in Section 7 which is followed by a 
conclusion and direction to future work in Section 8. 


2 Random Forests 


Learning Random Forests is a quite popular, and algorithmically relatively sim- 
ple classification technique that yields good results for many real-world appli- 
cations. Its decision model generalises a training dataset that holds examples 
of input data labelled with the desired output, also called class. As its name 
suggests, an ensemble of decision trees constitutes a Random Forest. Each of 
these trees is itself a classifier that was learned from a random sample of the 
training dataset. Consequently, all trees are different in structure, they represent 
different decision functions, and can yield different decisions for the very same 
input data. 

'To apply à Random Forest to previously unseen input data, every decision 
tree is evaluated separately: Tracing the trees from their root down to one of the 
leaves yields one decision per tree, i.e. the predicted class. The overall decision 
of the Random Forest is then derived as the most frequently chosen class, an 
aggregation commonly referred to as majority vote. The key advantage of this 
approach is, compared to single decision trees, the reduced variance. A detailed 
introduction to Random Forests, decision trees, and their learning procedures 
can be found in [3,17,23]. 

In this paper, we use Weka [28] as our reference implementation of Random 
Forests. However, our approach does not depend on implementation details and 
can be easily adapted to other implementations. 

Figure 1 shows a small Random Forests that was learned from the popular 
Iris dataset [8]. The dataset lists dimensions of Iris flowers’ sepals and petals 
for three different species. Using this forest to decide the species on the basis 
of given measurements requires to first evaluate the three trees individually and 
to subsequently determine the majority vote. This effort clearly grows linearly 
with the size of the forest. In the following we use this example to illustrate our 
approach of forest aggregation for explainability. 
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Tree Tree2 Trees 


petallength < 2.6 petallength < 2.45 petallength < 2.7 


Iris-selosa. petalwidth < 1.65 ris-setosa pelallength < 4.95 lrisaciosa petalwidth E» 
sepallength «605 7, ts-virgnica pelalih < 185 elaluidi < 17 Sepallengih < 7.05 petallength « 4.85 
^s 1 T D us 


Wevescoor] (C sepaman <27 pe m pre — Sepalengh «53 7) | Vechta 
x d 


sepalwidth < 2.55 petallength < 5.0 Iris-versicolor | | Iris-virginica ris-versicol lor ‘sepalwidth < 2.65 Irs-virginica | | lris-versicol lor 


Iris-virginica Iris-versicolor | | Iris-virginica Iris-virginica | | Iris-versicolor 


Fig. 1. Random Forest learned from the Iris dataset [8] (39 nodes). 


Key idea behind our approach is to partially evaluate the Random Forests 
at construction time which, in particular, eliminates redundancies between the 
individual trees of a Random Forest. E.g., in our accompanying Iris flower exam- 
ple (cf. Fig. 1) the predicate petalwidth < 1.65 is used in all three trees. This 
can easily lead to cases where the same predicate is evaluated many times in the 
classification process. The partial evaluation proposed in this paper transforms 
Random Forests into decision structures where such redundancies are totally 
eliminated. 

An adequate data structure to achieve this goal for binary decisions are 
Binary Decision Diagrams [1,4,19] (BDDs): For a given predicate ordering, they 
constitute a normal form where each predicate is evaluated at most once, and 
only if required to determine the final outcome. 

Algebraic Decision Diagrams (ADDs) [2] generalise BDDs to capture func- 
tions of the type B^ — C" which are exactly what we need to specify the 
semantics of Random Forests for a classification domain C. Moreover, in analogy 
to BDDs, which inherit the algebraic structure of their co-domain B, ADDs also 
inherit the algebraic structure of their co-domains if available. 

We exploit this property during the partial evaluation of Random Forests by 
considering the class vector co-domain (cf. Sect. 3). The aggregation to achieve 
the corresponding optimised decision structures is then a straightforward conse- 
quence of the used ADD technology. 


3 Class Vector Aggregation 


Class vectors faithfully represent the information about how many trees of the 
original Random Forest voted for a certain outcome. Obviously, this informa- 
tion is sufficient to obtain the precise results of a corresponding majority vote. 
Formally, the domain of class vectors forms a monoid 


V = (NF, +,0) 


where addition + is defined component-wise and O is the neutral element. 
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Fig. 2. Class vector aggregation of the Random Forest (83 nodes). 


Fig. 3. Class vector aggregation of the Random Forest without semantically redundant 
nodes (43 nodes). 


With the compositionality of the algebraic structure V and the corresponding 
ADDs Dy, we can transform any Random Forest incrementally into a semanti- 
cally equivalent ADD. Starting with the empty Random Forest, i.e. the neutral 
element 0, we consider one tree after the other, aggregating a growing sequence 
of decision trees until the entire forest is entailed in the new decision diagram. 
The details of this transformation are described in [14]. Figure 2 shows the result 
of this transformation for our running example. 


4 Infeasible Path Elimination 


When aggregating the trees of a Random Forest they all use varying sets of 
predicates. In contrast to simple Boolean variables, predicates are not indepen- 
dent on one another, i.e. the evaluation of one predicate may yield some degree 
of knowledge about other predicates. E.g., the predicate petallength < 2.45 
induces knowledge about other predicates that reason about petallength: When 
the petal length is smaller than 2.45 it cannot possibly be greater or equal to 
2.7 at the same time. This is not taken care of by the symbolic treatment of 
predicates we followed until now. In fact, predicates are typically considered 
independent in the ADD/BDD community. 

Infeasible path elimination, as illustrated by the difference between Figure 2 
and Figure 3 for our running example, leverages the potential of a semantic 
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treatment of predicates with significant effect on the size of the resulting ADDs. 
In fact, the experiments with thousands of trees reported in [14] would not have 
been successful without infeasible path elimination. 

Please note that infeasible path elimination 


— is only required after aggregation: The trees in the original Random Forest 
have no infeasible paths by construction. They are introduced in the course 
of our symbolic aggregation, which is insensitive to semantic properties. 

— is compositional and can therefore be applied during the stepwise transfor- 
mation, before the final most frequent label abstraction (cf. Sect. 5), and at 
the very end. 

— does not support normal forms: Whereas class vector abstraction is canon- 
ical for a given variable ordering, infeasible path elimination is not! Thus 
our approach may yield different decision diagrams depending on the order 
of tree aggregation. It is guaranteed, however, that the resulting decision 
diagrams are minimal. 


Infeasible path elimination is a hard problem in general! Our corresponding 
implementation uses SMT-solving [21] to eliminate all infeasible paths. An in- 
depth discussion of infeasible path elimination is a topic in its own and beyond 
the scope of this paper. 


Class vector aggregation and infeasible path elimination are both compositional 
and can therefore be applied in arbitrary order without changing the seman- 
tics. The majority vote at compile time described in the next section is not 
compositional and must therefore be applied at the very end. 


5 Majority Vote at Compile Time 


As mentioned above, maintaining the information about the result of the major- 
ity votes is not compositional. In fact, knowing the result of the majority votes 
for two Random Forest gives no clue about the majority vote of the combined 
forest. Thus the majority vote abstraction can only be applied at the very end, 
after the entire aggregation has been computed compositionally. 

'The result of the compositional aggregation process, including infeasible path 
elimination, is a decision diagram d € Dy with class vectors in its terminal nodes. 
The majority vote abstraction Ac : Dy — Do can now be defined as the lifted 
version of the majority vote abstraction on class vectors v € NICI (cf. [14]): 

ôc (v) := arg max ve. 
cec 
Note that óc does not project into the same carrier set but rather from one 
algebraic structure V into another C. However, these transformations can be 
applied to the corresponding decision diagrams in the very same way. Fig. 4 
shows the result of the most frequent class abstraction for our running example. 


1 For the cases considered here it is polynomial, but there are of course theories for 
which it becomes exponentially hard or even undecidable. 
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Fig. 4. Most frequent label abstraction of the aggregated Random Forest (majority 
vote) without semantically redundant nodes (18 nodes). 


6 Forest GUMP and Three Problems of Explanability 


Forest GUMP? (Generalized Unifying Merge Process) is a tool we developed to 
illustrate the power of algebraic aggregation for the optimization and explanation 
of Random Forests. It is designed to allow everyone, in particular people without 
IT or machine learning knowledge, to experience the nature of Random Forests. 
'To avoid unnecessary entry hurdles, we decided to implement Forest GUMP as 
a simple to use web application. It allows the user to experience the methods 
described in the previous sections and the proposed solutions to the explanability 
problems which will be illustrated in the following sections. We will first give a 
brief overview of Forest GUMP and then showcase its potential in the following 
sections. 

Forest GUMP’s user interface (see Figure 5) is essentially divided into two 
parts. On the left side the user can input the necessary data to learn a Random 
Forest and subsequently visualize it while the currently chosen representation 
will be visualized on the right side. First, the user has to upload a dataset or 
choose one of six datasets that we provide (cf. (1) in Fig. 5) on which the Random 
Forest will be learned. Next, the hyperparameters necessary for the learning 
procedure have to be selected, such as the number of trees to be learned (cf. 
(2) in Fig. 5). Then, one can choose different aggregation methods, i.e. the ones 


? A link to a running instance of Forest GUMP is available at https://gitlab.com/ 
scce/forest-gump. 
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Fig. 5. Overview of Forest GUMP. The visualized ADD is our solution to the class 
class characterization problem (cf. Sect. 6.3) for the class Iris-Setosa. 
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Fig. 6. The execution history in Forest GUMP. 


mentioned in the previous sections and further ones which will be explained in 
the following Sections (cf. (3) in Fig. 5). It it also possible to input a sample, 
classify it with the ADD and highlight the path from the root the leaf (satisfied 
predicates are highlighted in green, unsatisfied predicates are highlighted in red). 
In the end, the currently visualized ADD can be exported as Forest GUMP 
provides code generators for Java, C++, Python and GraphViz’s dot format (cf. 
(4) in Fig. 5). Additionally, the currently visualized ADD can be exported as an 
SVG to be viewed locally (cf. (4) in Fig. 5). 

The grey rectangle (cf. (6) in Fig. 5) points to the root of the currently 
visualized ADD. One can zoom into/out which can be helpful when the ADDs 
are rather large (cf. (6) in Fig. 5). On the top left the number of nodes and 
the length of the currently highlighted path are displayed (cf. (7) in Fig. 5). On 
the bottom right, one can open a history of all the representations one chose to 
visualize (cf. (8) in Fig. 5). 

Figure 6 shows the expanded execution history. For each visualized ADD, the 
execution history lists the aggregation variant, the hyperparameters used to learn 
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Fig. 7. The user can either choose to upload their own dataset or select one of six 
exemplary datasets. 


the Random Forest and the size (i.e. the number of nodes) and the maximum 
depth which is the longest path from root to leaf. The execution history also 
allows one to replay an experiment by clicking on the button on the right side of 
a row which allows one to compare different ADD variants. One can also delete 
the individual entries or the whole history and export the history to a CSV. 


6.1 A Walkthrough of Forest GUMP 


In the following we will see how hard it is to understand how a Random Forest 
comes to its decision and provide methods for solving the three explainability 
problems with absolute precision. 


Learning a Random Forest To begin, we need a Random Forest which re- 
quires a dataset on which it will be learned. In Forest GUMP, the user can 
upload their own dataset in the Attribute-Relation File Format (ARFF) [28]. 
Alternatively, we provide six exemplary datasets from which a user can select 
one to directly start using the tool. Figure 7 illustrates how this looks like in 
Forest GUMP. Having chosen a dataset, next, the hyperparameters necessary for 
the learning procedure of the Random Forest have to be specified (see Figure 8). 
The inputs are the following: 


— the number of trees to be learned, 
— the bagging size, i.e. the fraction of samples to be used to learn each tree and 
— a seed to be able to reproduce the setting.? 


Additionally, the user can decide to eliminate the infeasible paths as this can 
strongly reduce the size of the ADDs (see Section 4). While the predicate order is 
fixed by default, the user can decide to let Forest GUMP optimize the predicate 
order as the order can also greatly impact the size of the ADDs. A more in 


3 One can generate a random seed by clicking on the button next to the input field. 
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& ForestGUMP 
[7] Help/Additional Information You haven't learned an ADD yet. 
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Clear Dataset 


© Learn Random Forest 


Number of trees 
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Bagging size 


50 


42 Random Seed 


O Filter infeasible Paths 
O Optimize Predicate Order 


Show ADD Forest 


Fig. 8. The user has to specify the necessary hyperparameters to be able to learn 
a Random Forest. While the first three hyperparameters are needed for the learning 
procedure, the elimination of the infeasible paths and the optimization of the predicate 
order are specific to our aggregation method. 


depth discussion on the interplay between the infeasible path elimination and 
the predicate order will follow. Figure 9 shows a Random Forest that was learned 
on the Iris dataset, consisting of 20 trees*, a bagging size of 100% and 58 as the 
seed. If we now want to classify a given input, for each tree we would have to 
traverse from the root to the leaf and receive one predicted class per tree. The 
class which was predicted most often is the final result. Trying to understand 
why the Random Forest predicted this specific class is seemingly impossible. In 
the following we will show how we can do better. 


6.2 Model Explanation Problem 


The canonical white-box model corresponding to the Random Forest of Figure 9 
can be constructed through the most frequent label abstraction (see Sect. 5) of 
the aggregated Random Forest (see Sect. 3), whose infeasible paths are elimi- 
nated (see Sect. 4). This solves the Model Explanation Problem. 

Figure 10 sketches the result of this construction: A canonical white-box 
model with 310 nodes. Admittedly, this model is still frightening, but given a 
sample, it allows one to easily follow the corresponding classification process, and 
in this case it may require at most 19 individual decisions based on the petal 


^ Note that each decision tree is represented as an ADD. 
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Fig.9. A Random Forest consisting of 20 individual decision trees (191 number of 
nodes, longest path consists of 9 nodes). Note that each decision tree is represented as 
an ADD and that all ADDs share common subfunctions, i.e. it is essentially a shared 
ADD forest. The actual Random Forest, where nothing is shared, contains 284 nodes. 


T 


Fig. 10. An extract of the model explanation. The ADD is constructed from the most 
frequent label abstraction of the aggregated Random Forest following an elimination 
of all infeasible paths (310 nodes, longest path with length 19, the highlighted path 
has a length of 9). 


and sepal characteristics. This decision set is our set of predicates. The con- 
junction of these predicates is a solution to the Outcome Explanation Problem. 
However, more concise explanations are derived from the class characterization 
BDD discussed in the following section. 

Given the sample petallength — 2.4, petalwidth — 1.8, sepallength — 5.9, 
sepalwidth — 2.5, the outcome explanation given by the model explanation con- 
sists of the following 9 predicates (in Figure 10 satisfied predicates are highlighted 
in green, unsatisfied predicates are highlighted in red): 


—(petalwidth < 0.75) ^ —(petalwidth < 1.7) ^ (petallength < 4.95) ^ 
(sepalwidth « 2.65) ^ (petallength « 4.85) ^ (sepallength « 5.95) ^ 
—^(petalwidth < 1.75) ^ (petallength < 2.6) ^ (petallength < 2.45) 
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Fig. 11. The class characterization for the class Iris-Setosa (10 nodes, the highlighted 
path is also the longest path with length 5). The leaf corresponding to Iris-Setosa 
is highlighted in green, the leaf representing all other classes (i.e. Iris-Virginica and 
Iris- Versicolor) is highlighted in red. 


While this is already an improvement compared to the Random Forest, where 
you would have to traverse all 20 decision trees, we will see how we can improve 
even more in the following. 


6.3 Class Characterization Problem 


'The class characterization problem is particularly interesting because it allows 
one to ‘reverse’ the classification process. While the direct problem is ‘given 
a sample, provide its classification’, the reverse problem sounds ‘given a class, 
what are the characteristics of all the samples belonging to this class?’ 

BDD-based Class Characterisation can be defined via the following simple 
transformation function: Given a class c € C, we define a corresponding projec- 
tion function óp(c) : C > B on the co-domain as 


bale) = f £c 


0 otherwise. 


for c' € C. Again, the function óp(c) can be lifted to operate on ADDs, yielding 
Ap(c) : Do — Dg. 

'The BDD shown in Figure 11 is a minimal characterization of the set of all 
the samples that are guaranteed to be classified as Iris-Setosa. 
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Fig. 12. The outcome explanation for the input petallength — 2.4, petalwidth — 1.8, 
sepallength — 5.9, sepalwidth — 2.5 (10 nodes, hightlighted path of length 5). 


Being able to reverse a learned classification function has a major practi- 
cal importance. Think, e.g., of a marketing research scenario where data have 
been collected with the aim to propose bestfitting product offers to customers 
according to their user profile. This scenario can be considered as a classification 
problem where the offered product plays the role of the class. Now, being able to 
reverse the customer — product classification function provides the marketing 
team with a tailored product — customer promotion process: for a given prod- 
uct, it addresses all customers considered to favor this very product as in the 
corresponding patent [18]. 

The path highlighted in Figure 11 is the path from the root to the leaf 
for the same sample petallength — 2.4, petalwidth — 1.8, sepallength — 5.9, 
sepalwidth — 2.5. Compared to the path with length 9 in the model explanation, 
we now have a path of length 5 with the following predicates: 


—(petalwidth « 0.75) ^ (petallength « 4.95) ^ (petallength « 4.85) ^ 
(petallength « 2.6) ^ (petallength « 2.45) 
6.4 Outcome Explanation Problem 


The previous classification formula expresses the collection of ‘conditions’ that 
this sample satisfies, and it provides therefore a precise justification why it is 
classified in this class. Despite the fact that the class characterization BDD is 
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canonical, it is easy to see that there are some redundancies in the formula. For 
example, a petallength < 2.45 is also inherently smaller than 2.6, 4.85 and 4.95; 
therefore, for this specific sample those three predicates are redundant. This is 
the result of the imposed predicate ordering in BDDs: all the BDD predicates are 
listed, and they are listed in a fixed order. After eliminating these redundancies, 
we are left with the following precise minimal outcome explanation: this sample 
is recognized as belonging to the class Iris-Setosa because it has the properties 
—(petalwidth < 0.75) ^ (petallength < 2.45). 

In Forest GUMP we make these redundant predicates explicit by highlighting 
them in blue (see Figure 12). From 9 predicates in the model explanation to 5 
predicates in the class characterization, we have now arrived at an explanation 
that only consists of 2 predicates. 


7 Lessons Learned 


Playing with Forest GUMP led to interesting observations not only concerning 
the analyzed data domains but also concerning Random Forest Learning and 
the applied ADD technology. 


Random Forest Learning. Changing the random seed for the learning process 
had a significant impact on the size of the explanation models and the class 
characterizations. The observed sizes of the explanation models ranged from 138 
to 519. Interesting was that the larger sizes did not necessarily imply a better 
prediction quality. The same also applied to the class characterizations. In fact, 
we observed a 100% prediction quality for a class characterization of only 3 
nodes, while a class characterization for the same species with 40 nodes only 
scored 33% prediction. 


Analyzed Data Domain. The class characterizations for the three iris species 
differed quite a bit. For two species the observed sizes were much bigger than the 
sizes of the third species, independently of the chosen random seed and bagging 
size. In fact, for Iris-Setosa we observed a class characterization with only 3 
nodes implying an outcome explanation for our chosen sample with only one 
predicate. Figure 13 serves for the corresponding explanation. Put it differently, 
class characterizations seem to be good indications for ‘tightness’: The closer the 
samples lie the more criteria are required for separation. 


ADD Technology. ADDs are canonical as soon as one has chosen a predi- 
cate/variable ordering. Although we could observe the effect of corresponding 
optimization heuristics’, the impact was moderate and helpful mainly for model 
explanation and class characterization. Figure 14 shows the the outcome ex- 
planation for the same problem but where the ADD, representing the class 
characterization for the class Iris-Setosa, is reordered. While the reordering 


5 CUDD [25] provides a number of heuristics for optimizing variable orders. 
ê The used reordering method is named CUDD_REORDER_GROUP_SIFT_CONV as 
it was both, fast and effective, in our experiments. 
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Fig. 13. Visualization of the iris dataset using only the petal length and petal width. 


reduces the class characterization size from 10 to 8 nodes, the length of the 
outcome explanation is unchanged. For the model explanation of Figure 10, the 
size can be reduced from 310 nodes to 196 nodes while the path for the sample 
petallength = 2.4, petalwidth = 1.8, sepallength = 5.9, sepalwidth = 2.5 actu- 
ally increased by 1 (from 9 to 10). Thus the outcome explanation may even be 
impaired. This is not too surprising as these optimizations aim a size reduction 
and not depth reduction of the considered ADDs. We are currently investigating 
good heuristics for depth reduction. 

More striking was the impact of infeasible path elimination. In fact, this opti- 
mization can be regarded key for scalability when increasing the forest size. [14] 
reports results about forests with 10.000 trees. Without infeasible path reduction 
already 100 trees are problematic. 

Standard ADD frameworks work on Boolean variables rather than predicates. 
Thus in their setting infeasible paths do not occur. The problem of infeasible 
path reduction in ADDs was first discussed in [13,14]. Our current corresponding 
solution is still basic. We are currently generalizing our solution using more 
involved SMT technology. 


Of course, these observations where made on rather small datasets and it has 
to be seen how well they tranfer to more complex scenarios. We believe, how- 
ever, that they indicate general phenomena whose essence remains true in larger 
setting. 


8 Conclusion and Perspectives 


We have presented Forest GUMP (for Generalized, Unifying Merge Process) a 
tool for providing tangible experience with three concepts of explanation: model 
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NOT Iris-setosa 


Fig. 14. The outcome explanation for the input petallength — 2.4, petalwidth — 1.8, 
sepallength — 5.9, sepalwidth — 2.5 (8 nodes, highlighted path of length 5) where the 
class characterization from Figure 11 is reordered. 


explanation, outcome explanation, and class characterization. Key technology to 
achieve model explanation is algebraic aggregation, i.e. the transformation of a 
Random Forest into a semantically equivalent, concise white-box representation 
in terms of Algebraic Decision Diagrams. Class characterization is then achieved 
in terms of BDDs where the structure unnecessary to distinguish the considered 
class is collapsed. This abstraction is not only interesting in itself to better 
understand how easily the classes can be separated, but it also leads to highly 
optimized outcome explanations. Together with infeasible path elimination and 
the suppression of redundant predicates on a path, we observe reductions of 
outcome explanations by more than an order of magnitude. Forest GUMP allows 
even newcomers to easily experience these phenomena without much training. 

Of course, these are first steps in a very ambitious new direction and it has 
to be seen how far the approach carries. Scalability will probably require decom- 
position methods, perhaps in a similar fashion as illustrated by the difference 
between model explanation and the considerably smaller class characterization. 
More work is needed also on techniques that aim at limiting the number of 
involved predicates. 

Data Availability Statement: The artifact is available in the Zenodo 
repository [22]. 
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Abstract. GPU programs are widely used in industry. To obtain the 
best performance, a typical development process involves the manual or 
semi-automatic application of optimizations prior to compiling the code. 
'To avoid the introduction of errors, we can augment GPU programs 
with (pre- and postcondition-style) annotations to capture functional 
properties. However, keeping these annotations correct when optimizing 
GPU programs is labor-intensive and error-prone. 

This paper introduces ALPINIST, an annotation-aware GPU program op- 
timizer. It applies frequently-used GPU optimizations, but besides trans- 
forming code, it also transforms the annotations. We evaluate ALPINIST, 
in combination with the VerCors program verifier, to automatically op- 
timize a collection of verified programs and reverify them. 


Keywords: GPU - Optimization - Deductive verification - Annotation- 
aware - Program transformation 


1 Introduction 


Over the course of roughly a decade, graphics processing units (GPUs) have 
been pushing the computational limits in fields as diverse as computational biol- 
ogy [64], statistics [35], physics [7], astronomy [24], deep learning [29], and formal 
methods [17,43,44,65,67]. Dedicated programming languages such as CUDA [34] 
and OpenCL [42] can be used to write GPU source code. To achieve the most 
performance out of GPUs, developer should apply incremental optimizations, 
tailored to the GPU architecture. Unfortunately, this is to a large extent a man- 
ual activity. The fact that for different GPU devices, the same code tends to 
require a different sequence of transformations [21] makes this procedure even 
more time consuming and error-prone. Recently, automating this has received 
some attention, for instance by applying machine learning [3]. 
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NWO T'TW grant 17249 for the ChEOPS project 


© The Author(s) 2022 
D. Fisman and G. Rosu (Eds.): TACAS 2022, LNCS 13244, pp. 332-352, 2022. 
https://doi.org/10.1007/978-3-030-99527-0. 18 


ALPINIST: an Annotation-Aware GPU Program Optimizer 333 


User-Selected E. 


Transformations A 

H0 @ Annotation-Aware (Transformed \ 
/ V Program — | Annotated | 
[ Annotated ) Transformer \ Program / 
\ Program / h i 


+o elo 


Deductive Program Verifier 


Fig. 1: Annotation-Aware Program Transformation. 


Reasoning about the correctness of GPU software is hard, but necessary. Mul- 
tiple verification techniques and tools have been developed to aid in this task 
aimed at detecting data races, see [8, 10, 14, 32, 33], and for a recent overview, 
see [22]. Some of these techniques apply deductive program verification, which 
requires a program to be manually augmented with pre- and postcondition an- 
notations. However, annotating a program is time consuming. The more complex 
a program is, the more challenging it becomes to annotate it. In particular, as a 
program is being optimized repeatedly, its annotations tend to change frequently. 

This paper presents ALPINIST, a tool that can apply annotation-aware trans- 
formations [26] on annotated GPU programs. It can be used with the deductive 
program verifier VerCors [9]. VerCors can verify the functional correctness of 
GPU programs [10]. It allows the verification of many typical GPU computa- 
tions, see e.g., [48,50,51]. The purpose of ALPINIST is twofold (see Fig. 1): First, it 
automates the optimization of GPU code, to the extent that the developer needs 
to indicate which optimization needs to be applied where, and the tool performs 
the transformation. Interestingly, the presence of annotations is exploited by 
ALPINIST to determine whether an optimization is actually applicable, and in 
doing so, can sometimes apply an optimization where a compiler cannot. Second, 
as it applies a code transformation, it also transforms the related annotations, 
which means that once the developer has annotated the unoptimized, simpler 
code, any further optimized version of that code is automatically annotated with 
updated pre- and postconditions, making it reverifiable. This avoids having to 
re-annotate the program every time it is optimized for a specific GPU device. 

ALPINIST supports GPU code optimizations that are used frequently in prac- 
tice, namely loop unrolling, tiling, kernel fusion, iteration merging, matrix lin- 
earization and data prefetching. In the current paper, we discuss how ALPINIST 
has been implemented, how it can be applied on annotated GPU code, and how 
some of the more complex optimizations work. In addition, we evaluate the ef- 
fect of applying several of these optimizations, both in terms of annotation size 
and time needed to verify a program, to a collection of examples including the 
verified case studies in [48, 49,51]. 


Outline. Section 2 demonstrates how ALPINIST optimizes a verified GPU pro- 
gram while preserving its provability. Section 3 discusses the architecture of 
ALPINIST. Section 4 discusses the most complex optimizations supported by 
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/*0 context everywhere N > O && N < a.length; 
req (\forall* int i; 0 <= i < a.length; Perm(a[i], 1)); 
ens (\forall* int i; 0 <= i < a.length; i !- a.length-1 ==> Perm(a[i*ti], 1)); 
ens (\forall* int i; 0 <= i < a.length; i == a.length-1 ==> Perm(a[0], 1)); 
ens (\forall int i; 0 <= i < a.length-1; a[iti] == N*i); 
ens a[0] == N*(a.length-1); @*/ 
void Host(int[] a, int size, int N) 1 
par Kernel1 (int tid = 0 .. a.length) 
/*@ context Perm(a[tid], 1); 
ens a[tid] == 0; @*/ 
{ a[tid] = 0; } 
par Kernel2 (int tid = 0 .. a.length) 
/*@ context tid != a.length-1 ? Perm(a[tid*1], 1) : Perm(a[0], 1); 
req tid !- a.length-1 ? a[tid+1] == : a[0] == 0; 
ens tid !- a.length-1 ? a[tidt1] == N*xtid : a[0] == N*tid; @*/ 
{ /*0 inv k >= O && k <= N; 
inv tid !- a.length-1 ? Perm(a[tid+1], 1) : Perm(a[O], 1); 
inv tid !- a.length-1 ? a[tid*1] == k*tid : a[0] == k*tid;@*/ 
for(int k = 0; k < N; k++) { 
if (tid != a.length-1) { a[tid+1] = a[tid+1] + tid; } 
else { a[0] = a[0] + tid; } 
lj 
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Fig. 2: A verified GPU-style program 


ALPINIST in detail, namely loop unrolling, tiling and kernel fusion, and briefly 
discusses the remaining three. Section 5 presents the results of experiments in 
which the tool has been applied on a collection of programs. Section 6 discusses 
related work and Section 7 concludes the paper, and discusses future work. 


2 Annotation-Aware Optimization using ALPINIST 


This section illustrates how ALPINIST can optimize a verified GPU program while 
preserving its provability. Fig. 2 shows a GPU program with annotations [10] that 
is verified by VerCors. The example is written in a simplified version of VerCors’ 
own language PVL. The program initializes an array a, and subsequently updates 
the values in a, N times. The workflow of a GPU program in general is that the 
host (i.e., CPU) invokes a kernel, i.e., a GPU function, executed by a specified 
number of GPU threads. These threads are organized in one or more thread 
blocks. In this program, there are two kernels, both executed by one thread 
block of a. length threads (lines 8 and 12 (1.8, 1.12))?. Each thread has a unique 
identifier, in the example called tid. In the first kernel (1.8-1.11), each thread 
initializes a[tid] to 0. In the second kernel (1.12-1.22), each thread updates 
a[tid+1] (modulo a.length) N times, by adding tid to it. In the main Host 
function, Kernel1 is called, followed by Kernel2. 

'The kernels, the for-loop and the host function are annotated for verification 
(in blue), using permission-based separation logic [6, 11,12]. Permissions capture 
which memory locations may be accessed by which threads; they are fractional 
values in the interval (0, 1] (cf. Boyland [12]): any fraction in the interval (0, 


? In practice, the size of a block cannot exceed a specific upper-bound, but for this 
example, we assume that a.length is sufficiently small. 
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1 |/*@ context everywhere N > 0 && N < a.length; 

2 req (\forall* int i; 0 <= i < a.length; Perm(a[i], 10); 

3 ens (\forall* int i; 0 <= i < a.length; i != a.length-1 ==> Perm(a[i*til, 1)); 
4 ens (\forall* int i; 0 <= i < a.length; i == a.length-1 ==> Perm(a[0], 1)); 
5 ens (\forall int i; 0 <= i < a.length-1; a[iti] == N*i); 

6 Jens a[0] == N*(a.length-1); 0*/ 

7 |void Host(int[] a,int size,int N){ 

8 par Fused Kernel(int tid = 0 .. a.length) 

9 /*Q 

0 

1 Qx/ 

2 

3 

4 

5 

6 

7 @*/ 

8 barrier (Fused_Kernel) 

9 

20 int a reg O, a reg. 1; 

21 if (tid !- a.length-1) { a reg.1 = a[tid*1] } else ( a reg O = a[0] } 
22 intk = 0; 

23 if (tid != a.length-1) { a_reg_1 = a_reg_1 + tid; } 

24 else ( a_reg_0 = a reg. O + tid; } 

25 k ++; 

26 /*Q 
27 inv tid !- a.length-1 ? Perm(a[tid+1i], 1) : Perm(a[0], 1); 
28 | inv tid !- a.length-1 ? aregi == ketid : areg.0 == ketid; 0«/ 
29 for(k; k < N; k++) { 
30 if (tid != a.length-1) { a_reg_1 = a_reg_1 + tid; } 
31 else { a_reg_0 = a_reg_0 + tid; } 
32 } 
33 if (tid != a.length-1) { a[tid+1] = a_reg_1 } else { a[0] = a_reg_0 }; 
34 |} } 


Fig. 3: An optimized GPU-style program, annotated for verification 


1) indicates a read permission, while 1 indicates a write permission. A write 
permission can be split into multiple read permissions and read permissions can 
be added up, and transformed into a write permission if they add up to 1. The 
soundness of the logic ensures that for each memory location, the total number 
of permissions among all threads does not exceed 1. 

To specify permissions, predicates are used of the form Perm(L, 7) where L 
is a heap location and 7 a fractional value in the interval (0, 1] (e.g., 1\3). Pre- 
and postconditions, denoted by keywords req and ens, should hold at the begin- 
ning and the end of an annotated function, respectively. The keyword context 
abbreviates both req and ens (1.9, 1.13). The keyword context_everywhere is 
used to specify a property that must hold throughout the function (1.1). Note 
that V£orall* is used to express a universal separating conjunction over permis- 
sion predicates (1.2-1.4) and \forall is used as standard universal conjunction 
over logical predicates (1.5). For logical conjunction, && is used and ** is used as 
separating conjunction in separation logic. 

In the example, write permissions are required for all locations in a (1.2). 
'The pre- and postconditions of the first kernel specify that each thread needs 
write permission for a[tid] (1.9). The postcondition states that a[tid] is set 
to 0 (1.10). In the second kernel, all threads have write permission for a[tid*1], 
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except thread a. length-1 which has write permission for a[0] (1.13). Moreover, 
it is required that a[tid+1] (modulo a. length) is 0 (1.14). For the for-loop (1.19- 
1.22), loop invariants are specified: k is in the range [0, N] (1.16), each thread has 
write permission for a[tid*1] (modulo a. length) (1.17) and this location always 
has the value k*tid (1.18). The postconditions of the second kernel and the host 
function are similar to this latter invariant. 

Fig. 3 shows an optimized version of the program, with updated annotations 
to make it verifiable. ALPINIST has applied three optimizations: 


1. Fusing the two kernels: in GPU programs, the only global synchronisation 
points (used, for instance, to avoid data races) exist implicitly between ker- 
nel launches. However, if such a global synchronisation point is not really 
needed between two specific kernels, then fusing them gives several benefits, 
in particular the ability to store intermediate results in (fast) thread-local 
register memory as opposed to (slow) GPU global memory, and it has a 
positive effect on power consumption [62]. In the example, the kernels are 
combined into Fused. Kernel, and a thread block-local barrier is introduced 
(1.18) to avoid data races within the single thread block executing the code. 

2. Using register memory; register variables can be used to reduce the number 
of global memory accesses. Here, the use of a reg 0 and a reg 1 has been 
enabled by kernel fusion. 

3. Unrolling the for-loop; the for-loop has been unrolled once here (1.20-1.25). 
Since GPU threads are very light-weight, compared to CPU threads, any 
checking of conditions that can be avoided benefits performance. When un- 
rolling a loop, this means that fewer checks of the loop-condition are needed. 
Note that here, ALPINIST benefits from the knowledge that N > 0 (1.1), so it 
knows that the for-loop can be unrolled at least once. 


To preserve provability of the optimized program, ALPINIST changed the 
annotations, in particular the pre- and postcondition of the fused kernel and 
the loop invariants (highlighted in Fig. 3). Moreover, ALPINIST introduced an 
annotated barrier (1.14-1.18). Since threads synchronize at a barrier, it is possible 
to redistribute the permissions. In the rest of the paper, we discuss how ALPINIST 
performs these annotation-aware transformations. 


3 The Design of ALPINIST 


This section gives a high-level overview of the design of ALPINIST. The opti- 
mizations supported by ALPINIST are discussed in Section 4. To understand the 
design of ALPINIST, we first explain the architecture of the VerCors verifier. 


3.1 VerCors’ Architecture 


VerCors is a deductive program verifier, which is designed to work for different in- 
put languages (e.g., Java and OpenCL). It takes as input an annotated program, 
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which is then transformed in several steps into an annotated Silver program. Sil- 
ver is an intermediate verification language, used as input for Viper [37, 60]. 
Viper then generates proof obligations, which can be discharged by an auto- 
mated theorem prover, such as Z3 [36]. 

The internal transformations in VerCors are defined over our internal AST 
representation (written in the Common Object Language or COL [52]), which 
captures the features of all input languages. Some of the transformations are 
generic (e.g., splitting composite variable declarations) and others are specific 
to verification (e.g., transforming contracts). The transformations implemented 
as part of Alpinist are also applied on the COL AST, but they are developed 
with a different goal in mind, and in particular several of the transformation are 
specific to the supported optimizations. 

Using VerCors and its architecture to implement ALPINIST gives us some ben- 
efits. First, existing helper functions can be reused, which simplifies tasks such 
as gathering information regarding specific AST nodes. Second, some generic 
transformations of VerCors can be reused, such as splitting composite variable 
declarations or simplifying expressions. This helps to simplify the implementa- 
tion of the optimizations. Third, using the architecture of VerCors allows us to 
prove assertions that we generate relatively easily by invoking VerCors internally. 


3.2  ALPINIST's Architecture 


ALPINIST takes a verified file as its input, annotated with special optimiza- 
tion annotations that indicate where specific optimizations should be applied. 
ALPINIST is written in Java and Scala and runs on Windows, Linux and macOS. 
Fig. 4 gives a high-level overview of the internal design of ALPINIST. The input 
program goes through four phases: the parsing phase, the applicability checking 
phase, the transformation phase and the output phase. 

The parsing phase transforms the input file into a COL AST, after which 
the applicability checking phase checks if the optimization can be applied. Some 
optimizations, such as tiling (see Section 4.2), are always applicable, hence their 
applicability check always passes. For other optimizations, prerequisites must be 
established. Sometimes, a syntactical analysis of the AST suffices, e.g., kernel 
fusion (see Section 4.3). For this optimization, it must be determined whether 
there is any data dependency between two selected kernels. When analysis of the 
AST is not enough, VerCors can be used to perform more complex reasoning. 
An example of this is loop unrolling (see Section 4.1). Its prerequisite is that for 
the loop to be unrollable k times, it is guaranteed that the loop executes at least 
k times. This prerequisite is encoded as an assertion to be proven by VerCors. 

The applicability checking phase is one of the strengths of ALPINIST. It ex- 
ploits the fact that the input program is annotated to determine whether an 
optimization is applicable, and relies on the fact that VerCors can perform com- 
plex reasoning. Moreover, this approach allows to distinguish failure due to un- 
satisfied prerequisites and due to mistakes in the transformation procedure. 
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Fig. 4: The internal design of ALPINIST. 


If the applicability check passes (i.e., the optimization is applicable), the 
transformation phase is next, otherwise a message is generated that the prereq- 
uisites could not be proven. 

The transformation phase applies the optimizations to the input AST. The 
output phase either prints the optimized program in the same language as the 
input program, or a message is printed, signifying either a failure in optimizing 
or a verification failure in the applicability checking phase. 


4 GPU Optimizations 


ALPINIST supports six frequently-used GPU optimizations, namely loop un- 
rolling, tiling, kernel fusion, iteration merging, matrix linearization and data 
prefetching. This section discusses loop unrolling, tiling, and kernel fusion in 
detail. The other optimizations follow the same approach in spirit and are dis- 
cussed briefly, which can be found in the ALPINIST implementation [16]. Each 
optimization is introduced in the context of GPU programs. Then, we discuss 
how to apply them. Interesting insights are discussed where relevant. 


4.1 Loop Unrolling 


Loop unrolling is a frequently-used optimization technique that is applicable 
to both GPU and CPU programs. It unrolls some iterations of a loop, which 
increases the code size, but can have a positive impact on program performance; 
e.g., see [21, 38, 46, 59, 63] for its impact, specifically on GPU programs. Fig. 5 
shows an example of unrolling an (annotated) loop twice: the body of the loop is 
duplicated twice before the loop. This has the following effect on the annotations: 
the loop invariant bounding the loop variable (1.5) changes in the optimized 
program (1.14). Note that the other loop invariants (i.e., Inv(i)) remain the 
same. Moreover, after each unrolling part, we add all invariants as assertions 
(1.8-1.10) except after the last unroll. This captures that the code produced by 
unrolling the loop should still satisfy the original loop invariants. 

Our approach to loop unrolling is more general than optimization techniques 
during compilation. For instance, the unroll pragma in CUDA [55] and the 
unroll function in Halide [56] unroll loops by calculating the number of iterations 
to see if unrolling is possible, i.e., it should be computable at compile time. 
This difference is illustrated in Fig. 5 where N (i.e., the number of iterations) 
is unknown at compile time. Their approach cannot automatically handle this 
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1 /*0 context everywhere N > 1; @*/ 
2 |void Host(int[] arr, int size, int N){ 
3 par Kernel(tid-0..size)í 
4 int i = 0; 
5 int newInt = i; 
1 |/*@ context_everywhere N > 1; @*/ n T Es . 
2 |void Host(int[] arr, int size, int N){ 6 arr[tid] = arr[tid] + newInt; 
3 ar Kernel (tid=0..size){ 7 i-docd 
pe : U 8 //@ assert i >= 1 && i <= N; 
4 ime a =o 9 //@ assert N > 1; 
5 /*@ inv i >= 0 && i <= N; x 
i 10 //@ assert Inv(i); 
6 inv N > 1; : 
7 inv Inv(i); 0*/ B1 newint =d; 
: ij 12 arr[tid] = arr[tid] + newInt; 
8 loop (i < N){ 13 pee ate 
AER 14 /*0 inv i >= 2 kk i <= N; 
10 arr[tid] = arr[tid] + newInt; : 
i : 15 inv N > 1; 
11 icq ki. . R 
12 133 16 inv Inv(i); 0*/ 
17 loop (i < N){ 
18 newlnt = i; 
19 arr[tid] = arr[tid] + newInt; 
20 i=i+1;} 
21 |}} 
Fig. 5: An example of unrolling a loop 2 times. 
1 |void Host(int[] array, int size){ 
2 par Kernel(tid=0..size){ 
3 int i = init; // The loop variable 
4 : 
5 //@ assert (i == a) || (i == b); // Depending on initialization of i only one 
6 // of the conditions is specified 
7 /*0 inv i >= a && i <= b; // The lowerbound of i (a), The upperbound of i (b) 
8 inv Inv(i); @*/ // Additional loop invariants 
9 loop (cond(i)) { // The loop condition 
10 body(i); // The loop body, a sequence of statements in the if” iteration. 
11 i = upd(i); ) // The update function of i, restricted to (i+c), (i— c), 
12 |* } // (à X c) or (i/c) where c is a positive integer constant“. 


Fig. 6: A general template of a loop inside a kernel. 


case, while our approach can automatically unroll the loop, since annotations 
(1.1, 1.6) specify the lower-bound of N (provided by the programmer, who knows 
that this is a valid lower-bound). VerCors verifies that the unrolling is valid. 
Fig. 6 shows a loop template in a verified GPU program. We would like 
to automatically unroll the loop k times and preserve the provability of the 
program. To accomplish this, we follow a procedure consisting of three parts: 
the main, checking and updating part. In the main part, an annotated (verified) 
GPU program and positive k are given as input. Next we go to the checking 
part, to see if it is possible to unroll the loop k times. This part corresponds 
with the applicability checking phase. Thus, we statically calculate the number 
of loop iterations, by counting how many times the condition (cond(i)) holds 
starting from either a (as the lowerbound of i) or b (as the upperbound of i), 
depending on the operation of upd(i). If k is greater than the total number of 
loop iterations at the end of the checking part, then we report an error. Otherwise 


^1f c was negative, for the multiplication and division, i would oscillate between 
positive and negative values and hence would not always be useful as array index. 
Hence we consider c to be positive. 
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Fig. 7: Inter- and intra-tiling of an array as T = 12, N = 4 and [T/N] = 3. 


void Host(int[] a, int T){ 
par Kernel(tid = 0..T) 
/*@ // Preconditions related to permissions and functional correctness 
req prePerm(a[tid]) ** preFunc(a[tid]); 
// Postconditions related to permissions and functional correctness 
ens postPerm(a[tid]) ** postFunc(a[tid]); 0*/ 
{ body(a[tid]); ) } 


-OocuRGCtxtrG 


Fig. 8: A general unoptimized GPU program to apply for tiling. 


we go to the updating part, in which we update either a or b according to the 
operation in upd (i). If the operation is addition or multiplication, then the loop 
variable i (in the unoptimized program) goes from a to b. That means, after 
unrolling, a should be updated according to the constant c from the update 
expression and k. If the operation is subtraction or division, i goes from b to a. 
'Thus, after unrolling, b should be updated. After the updating part, we return 
to the main part to unroll the loop k times. 


4.2 Tiling 


Tiling is another well-known optimization technique for GPU programs. It in- 
creases the workload of the threads to fully utilize GPU resources by assigning 
more data to each thread. Concretely, we assume there are T threads and a one- 
dimensional array of size T in the unoptimized GPU program where each thread 
is responsible for one location in that array (Fig. 8). To apply the optimization, 
we first divide the array into [T/N] chunks, each of size N (1 < N < T)°. There 
are two different ways to create and assign threads to array cells (as in Fig. 7): 
— Inter-Tiling We define N threads and assign them to one specific location in 
each chunk. That means each thread serially iterates over all chunks to be 
responsible for a specific location in each chunk. 
— Intra-Tiling We define [T/N] threads and assign one thread to one chunk 

(i.e., 1-to-1 mapping) to serially iterate over all cells in that chunk. 

Both forms of tiling can have a positive impact on GPU program performance; 
e.g., see [25, 28, 47,69] for the impact of this optimization. 

Fig. 9 shows the optimized version of Fig. 8 by applying inter-tiling. Regard- 
ing program optimization, two major changes happen: 1) the total number of 
threads has reduced (1.2), and 2) the body is encapsulated inside a loop (1.16- 
1.18). As mentioned, in inter-tiling, we define N threads instead of T. The number 


5 Since N is in the range 1 < N < T, the last chunk might have fewer cells. 
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1 | void Host(int[] a, int T){ 

2 par Kernel(tid = 0..N) 

3 /*@ req (\forall* int i; 0 <= i && i < ceiling(T, N) && tidtixN < T; 
4 pre(a[tidtixN])); 

5 ens (\forall* int i; 0 <= i && i < ceiling(T, N) && tidtixN < T; 
6 post(a[tidtixN])); 0*/ 

7 { 

8 int j = 0; 

9 /*0 inv j >= 0 && j <= ceiling(T, N); 

0 inv (\forall* int i; O <= i && i < ceiling(T, N) && tidtixN < T; 

1 prePerm(a[tid*ixN])); 

2 inv (\forall int i; j <= i && i < ceiling(T, N) && tidt+tixN < T; 

3 preFunc(a[tid*ixN])); 

4 inv (\forall* int i; O <= i && i < j && tidtixN < T; 

5 postFunc(a[tid*ixN])); 0*/ 

6 loop (tidt+tjxN < T){ 

7 body (a[tidt+j XN]); 

8 j=j+1;} 

9 


FF 


Fig. 9: Optimized version of the GPU program of Fig. 8 after applying inter-tiling. 


of chunks is indicated by the function ceiling(T, N). Each thread in the newly 
added loop iterates over all chunks (in the range 0 to ceiling(T, N)-1) to be 
responsible for a specific location. This happens by the loop variable j and the 
loop condition tid*jxN < T. This means, each thread tid can access its own 
location at index tid in each chunk. To preserve verifiability, we add invariants 
to the loop (1.9-1.17). Therefore, we specify: 


— the boundaries of the loop variable j, which iterates over all chunks. 

— a permission-related invariant for each thread in each chunk (1.10). This 
comes from the precondition of the kernel and is quantified over all chunks. 

— an invariant to indicate functional properties of the locations that have not 
yet been updated by threads in the body of the loop (1.12). This comes from 
the functional precondition of the kernel and is quantified over all chunks. 

— an invariant to specify how each thread updates the array in each chunk 
(1.14). This comes from the functional property as the postcondition of the 
kernel and is quantified over all chunks. 


Moreover, we modify the specification of the kernel (1.3-1.6). Note that we have 
the condition tid*jxN < T in all universally quantified invariants, because the 
last chunk might have fewer cells than N. We quantified the pre- and postcondi- 
tion of the kernel over the chunks in the same way as the invariants. 

Intra-tiling is in essence similar to inter-tiling with two major differences: 1) 
the total number of threads is ceiling(T, N), and 2) each thread in the loop 
iterates over cells within its own chunk. Therefore, we have different conditions 
in the loop and the quantified invariants. ALPINIST also supports this. 


Above, each thread is assigned to one cell. This can easily be generalized 
to have each thread assigned to one or more consecutive cells (i.e., a task). A 
similar procedure can be applied as long as the tasks do not overlap, i.e., each 
cell is assigned to at most one thread. 
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4.3 Kernel Fusion 


Kernel fusion is a GPU optimization where we merge two or more consecutive 
kernels into one. It increases the potential to use thread-local registers to store 
intermediate results (see Section 2) and can lead to less power consumption. 
See [2,19,61,62, 68] for the impact of kernel fusion on GPU programs. We pro- 
vide a generalized procedure to fuse an arbitrary number of consecutive kernels 
while considering data dependency between them. The idea is to fuse them by 
repeatedly fusing the first two kernels (i.e., kernel reduction). In each iteration, 
if there is no data dependency between the two kernels, we safely fuse them. 
Else if there is only one thread block then we fuse the two kernels by inserting 
a barrier between the bodies, else fusion fails. 

A benefit of this approach is that it only considers two kernels at a time. 
In this way, it can be determined whether a barrier is necessary between two 
specific kernels, and we do not miss any possible fusion optimization. Another 
benefit of this approach is that when a data dependency between two kernels P 
and P+1 (1 « P « #kernels—1) is detected, the output of the approach is the 
fusion of the first P kernels, and the remaining unfused kernels after P. This 
allows the user to not only find out that there is a data dependency between P 
and P +1, but also to obtain fused kernels where possible. 

There are multiple challenges in this transformation: (1) how to detect data 
dependency between two kernels? (2) how to collect the pre- and postconditions 
for the fused kernel? and (3) how to deal with permissions so that in the fused 
kernel the permission for a location does not exceed 1? The main difficulty in 
addressing these challenges is that we have to consider many different possible 
scenarios. Fortunately, we can use the information from the contract of the two 
kernels. The permission patterns in the contract indicate for each thread which 
locations it reads from and writes to. We provide procedures to separately collect 
pre- and postconditions related to permissions and to functional correctness. Due 
to space limitations, we only discuss the essential steps to collect the precondition 
related to permissions for array accesses of the fused kernel in Alg. 1. Collecting 
the rest of the contract uses a similar procedure. 

Alg. 1 requires kernels k1 and k2 to not lose any permissions, only possibly 
redistribute them (using a barrier). Furthermore, for ease of presentation, we 
assume that in both k1 and k2, each thread accesses at most one cell of array a, 
and that the expressions used to compute array indices only combine constants 
and thread ID variables, using standard arithmetic operators. 

We compare the postcondition of k1 and the precondition of k2 (1.2) to 
understand how to add permissions of the preconditions of k1 and k2 to the 
precondition of the fused kernel. Note that prePerm and postPerm correspond 
to a permission-related pre- and postcondition, respectively. We use the post- 
condition of k1 for this comparison since the permission at the end of k1 needs 
to be sufficient to satisfy the precondition of k2. If the index expressions e1 and 
e2 to access an array a are syntactically the same, then they refer to the same 
array cell. In that case, we first add to the precondition of the fused kernel the 
original permission from the precondition of k1 that corresponds to the permis- 
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Algorithm 1 Kernel fusion procedure for collecting precondition permissions. 


1: Add all precondition permissions related to non-shared arrays (i.e., accessed by only one of the 
two kernels) into the contract of the fused kernel kf. 

2: for each shared array a with a permission postPerm(a[ei], p1) in the postcondition of the first 
kernel k1 and a permission prePerm(a[e2], p2) in the precondition of the second kernel k2 do 

3: if patterns e1 and e2 are syntactically the same then 

4 Add pre. of k1 corresponding to postPerm(a[eil, pi) as pre. to kf 

5 if p1 < p2 then 

6: Add prePerm(a[e2], p2-p1) as pre. to kf 

T else if patterns e1 and e2 are not syntactically the same then 

8 if p1 + p2 < 1 then 

9: Add pre. of k1 corresp. to postPerm(a[e1], p1) and prePerm(a[e2], p2) as pre. in kf 

10: else if p1 + p2 > 1 && p1 < 1 && p2 < 1 then 


11: Add pre. of k1 corresp. to postPerm(a[ei], pi) with permission p3 and prePerm(a[e2], 
12: p4) as pre. s.t. p3 + p4 == 1 

13: else if p1 == 1 (i.e., write) then » Data dependency, add barrier 
14: Add pre. of k1 corresponding to postPerm(a[e1], pi) as pre. to kf 

15: else p2 —— 1 » Data dependency, add barrier 
16: Add pre. of k1 corresponding to postPerm(a[e1], pi) as pre. to kf 

1T: Add prePerm(a[e2], 1-p1) as pre. to kf 


sion for a[e1] in the postcondition of k1 (remember that the latter permission 
may have been obtained in k1 after permission redistribution). Second, if p1 is 
not sufficient for the precondition of k2 (1.5), we add additional permission to 
the precondition of the fused kernel to satisfy the precondition of k2 (1.6). 

The remaining different cases in the algorithm correspond to the different 
edge cases that we should consider when e1 and e2 are not syntactically the 
same. In particular, data dependency happens when the accumulated permission 
(in both kernels) for one location is greater than 1, and there is at least one write 
permission. Therefore, we have to distinguish multiple cases: 1) p1 4- p2 does not 
exceed 1 (1.8), 2) p1 + p2 exceeds 1, but no write permission is involved (1.10), 
or 3) and 4) at least one write is involved (1.13 and 1.15). In the latter two cases, 
a barrier must be introduced to take care of distributing permissions from the 
access in k1 to the access in k2, and possibly additional permission for the latter 
must be added to the precondition of the fused kernel (1.17). After constructing 
the contract of the fused kernel, we check for data dependency. 

Fig. 10 shows an example of fusing two kernels. We only present the per- 
mission precondition expressions which are collected with Alg. 1. There are two 
shared arrays a and b. To collect permission preconditions in the fused kernel, 
we follow steps {1.2—1.3—1.4} for array a and steps (1.2231.3—1.4—1.5—1.6) for 
array b. As there is no data dependency, we can safely fuse the two kernels. 


Implementing Data Dependency Detection. One of the implementation chal- 
lenges of kernel fusion is to check data dependency in the applicability checking 
phase. Our idea of detecting kernel dependencies is similar to detecting loop 
iteration dependencies, see [1]. To detect data dependency for a specific shared 
array, the function SV is used. Fig. 11 shows an example of the output of SV. The 
kernel has 1\2 permission for a[tid*1] and 1X3 permission for a[0] if tid+1 is 
out of bounds. SV takes an array name and the pre- and postconditions of a ker- 
nel (of the form cond(tid) => Perm(a[patt(tid)], pJ)on1.3-1.6, and returns 
a mapping from indices patt(tid) to the permissions p (in Fig. 11: right). 
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1 | void Host(...){ 
2 par Kerneli(tidi = 0..T) 1 | void Host(...){ 
3 /*@ context Perm(a[tidi], 1); 2 par Fused_Kernel(tid = 0..T) 
4 context Perm(b[tidi], 1N2);0*/ 3 /*@ req Perm(a[tid], 1); 
5 { a[tidi] = 2*b[tidi]; } — 4 req Perm(b[tid], 1\2); 
6 par Kernel2(tid2 = 0..T) 5 req Perm(b[tid], 1\2) ;@*/ 
7 /*@ context Perm(a[tid2], 1\2); 6 { a[tid] = 2*b[tid]; 
8 context Perm(b[tid2], 1) ;@*/ 7 b[tid] = a[tid]+1; } } 
9 { b[tid2] = a[tid2]+1; } } 
Fig. 10: An example of collecting preconditions in fusing two kernels. 
1 | void Host(...){ 
2 ar Kerneli(tidi = 0..T) 
3 /« context (tid !- a.length-1 => Dutput SV(a, SPEC kernel) 
4 Perm(a[(tid + 1)], 1\2)); — index 0|[1[2|3]|4 
5 context (tid == a.length-1 => -— ilililili 
6 Perm(a[0], 1N3)); @*/ permssion|35[55|5/|5 
7 £o 


Fig. 11: Example output of the SV function for array a. 


If the function SV is executed for two kernels to fuse with the same shared 
array a, the results SV; (a) and SV2(a) can be compared to determine whether 
there is data dependency between the two kernels. This comparison is described 
generally at 1.8-1.16 in Algorithm 1. For each corresponding location in SV; (a) 
and SV2 (a), we can determine, for example, whether both permissions combined 
do not exceed 1 (1.8) or whether the location in k1 has write permission (1.12). 


4.4 Other Optimizations 


We briefly discuss the three remaining optimizations supported by ALPINIST. 
Iteration merging is an optimization technique related to loop unrolling that 
is applicable to both GPU and CPU programs. Iteration merging reduces the 
number of loop iterations by extending the loop body with multiple copies of it, 
as opposed to creating copies of it outside the loop, as is done in loop unrolling. 
Iteration merging can have a positive performance impact; see [38,46,53] for the 
effectiveness of this optimization on GPU programs. 

Matrix linearization is an optimization where we transform two-dimensional 
arrays into one dimension ones. This optimization can result in better memory 
access patterns, thereby improving caching. See [5,13,54] for the impact of matrix 
linearization on GPU programs. 

The last optimization implemented in ALPINIST is data prefetching. Suppose 
there is a verified GPU program where each thread accesses an array location 
in global memory multiple times. In this optimization, we prefetch the values 
of those locations that are in global memory into registers which are local to 
each thread. A similar optimization, in which intermediate results are stored in 
register memory, is applied in Section 2. Therefore, instead of multiple accesses 
to the high latency global memory, we benefit from low-latency registers. Data 
prefetching can have a positive performance impact; see [4, 58, 70]. 


6 Iteration merging is also referred to as loop unrolling/vectorization in the literature. 
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Table 1: A summary of the optimization and verification times for all optimizations. 


Optimization Optim. time (s) Verif. time (orig.) (s)|Verif. time (opt.) (s) 

min. max. avg. med.|min. max. avg. med.|min. max. avg. med. 
Loop unrolling 0.067 0.238 0.116 0.098| 7.6 50.7 18.2  14.3| 7.6 57.5 20.8 17.3 
Tiling 0.044 0.052 0.048 0.047] 16.7 21.5 18.7 18.1} 19.3 31.4 24.7 20.8 
Kernel fusion 0.099 0.338 0.173 0.137] 16.7 54.5 24.6 20.0. 14.9 22.3 19.0 19.5 
Iteration merging 0.042 0.592 0.152 0.097| 6.9 51 17.0 12.7) 7.3 64 20.0 13.8 
Matrix linearization||0.011 0.044 0.022 0.017| 11.6 16 14.3 14.1) 11.5 16.8 14.4 15.1 
Data prefetching 0.010 0.068 0.051 0.053] 9.7 23 14.0 13.4] 10.4 23 13.5 12.7 


5 Evaluation 


This section describes the evaluation of ALPINIST. The goal is to 


Q1 test whether ALPINIST works on GPU programs. 

Q2 investigate how long it takes for ALPINIST to transform GPU programs and 
how this affects the verification time. 

Q3 investigate the usability of ALPINIST on real-world complex examples. 


5.1 Experiment Setup 


ALPINIST is evaluated on examples from three different sources. The first source 
consists of hand-made examples that cover different scenarios for each optimiza- 
tion. The second source is a collection of verified programs from VerCors' ex- 
ample repository’. The third source consists of complex case studies that are 
already verified in VerCors: two parallel prefix sum algorithms [51], parallel 
stream compaction and summed-area table algorithms [48], a variety of sort- 
ing algorithms [49], a solution [27] to the Verify This 2019 challenge 1 [18] and a 
Tic-Tac-Toe example [57] based on [23]. In total, we applied the optimizations 
30 times in the first category, 23 times in the second category and 17 times in 
the third category (in total 70 experiments). All the examples are annotated 
with special optimization annotations such that ALPINIST can apply those op- 
timizations automatically. All these examples are publicly available at [15]. All 
the experiments were conducted on a MacBook Pro 2020 (macOS 11.3.1) with 
a 2.0GHz Intel Core i5 CPU. Each experiment was performed ten times, af- 
ter which the average times, i.e., optimization and verification times, of those 
executions were recorded for the experiment. 


5.2 Results & Discussion 


Q1 To test whether ALPINIST works on GPU programs, we applied the six 
optimizations in all 70 experiments and used VerCors to reverify all the resulting 
programs. All these tests were successful. 

Q2 To investigate how long it takes for ALPINIST to transform GPU programs, 
we recorded the transformation time for each optimization applied to all the 


* The example repository of VerCors is available at https://github.com/utwente-fmt/ 
vercors/tree/dev /examples. 
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Table 2: An overview of optimizing case studies, where # is the unroll factor (for 
loop unrolling) or the merge factor (for iteration merging), OT the time it takes to 
optimize, VB the original verification time (Verification Before) and VA the optimized 


verification time (Verification After). All times are in seconds. 
Case Loop unrolling Iter. merging Matrix lin. Data pref. 
# OT VB VA zx OT VB VA| OT VB VA| OT VB VA 
BubbleSort [49] 1 0.101 25.4 27.3| 4 0.170 29.8 34.1} N/A N/A N/A| N/A N/A N/A 
InsertionSort [49] 0.134 25.6 25.8| 3 0.225 24.1 28.0| N/A N/A N/A| N/A N/A N/A 
SelectionSort [49] 0.107 23.5 25.7 2 0.592 22.8 27.7, N/A N/A N/A! N/A N/A N/A 
TimSort [49] 0.216 29.3 38.5| 3 0.182 29.1 37.9| N/A N/A N/A| N/A N/A N/A 
3 
2 


1 
1 
2 
Blelloch [51] 1 0.129 50.7 57.5 0.355 51.0 64.0. N/A N/A N/A! N/A N/A N/A 
Kogge-Stone [51] 1 0.238 23.0 25.6 0.082 21.8 25.6) N/A N/A N/A/0.103 23.0 23.0 
TicTacToe [57] 3 0.106 19.8 21.0) 2 0.076 17.3 19.6] N/A N/A N/A| N/A N/A N/A 
1 
N/A 


Verify This [27] 0.144 26.2 28.7 N/A N/A N/A N/A| N/A N/A N/A| N/A N/A N/A 
Transpose [48] N/A N/A N/A|N/A N/A N/A N/A|0.022 16.0 16.0] N/A N/A N/A 


examples. Table 1 summarizes the best and worst optimization times for the 
six optimizations (as reported by ALPINIST). To investigate the impact on the 
verification time, the table also shows the (best and worst) verification times of 
the original and optimized programs (as reported by VerCors). The table shows 
the minimum, maximum, average and median times of all examples. It can be 
observed that ALPINIST takes insignificant time to apply each optimization to 
all the examples. Moreover, the verification time after optimizing generally in- 
creases. For loop unrolling, tiling and iteration merging, the verification time 
increases. This can be attributed to the additional code that is generated. For 
kernel fusion, the verification time decreases. This is due to verifying fewer ker- 
nels. For matrix linearization and data prefetching, the verification time slightly 
increases. This can be attributed to the linear expressions in matrix linearization 
and the extra statements to read from/write to the registers in data prefetching. 
Q3 To investigate the usability of ALPINIST on real-world examples, we suc- 
cessfully applied it on the third category with the complex case studies. Table 2 
shows the optimization and verification times of applying loop unrolling, iter- 
ation merging, matrix linearization and data prefetching to these case studies. 
Note that in the case studies only these four optimizations could be applied. In 
the table, N/A indicates that the optimization is not applicable to the example. 


6 Related Work 


To the best of our knowledge, this is the first paper to showcase a tool that 
implements annotation-aware transformations. We categorize the related work 
into three parts, covering both tools and optimizations. 


Automatic Optimizations without Correctness. There is a large body of related 
work, see e.g., [2, 4, 19, 25, 28, 47, 61, 62, 68-70], that shows the impact of auto- 
mated optimizations on GPU programs, but does not consider correctness, or 
the preservation of it. Our tool can potentially complement these approaches by 
preserving the provability of the optimized programs. 
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Correctness Proofs for Transformations. Another body of related work focuses 
on different approaches to preserve provability not specific to GPU programs. 
COMPCert [30,31] is a formally verified C compiler which preserves semantic 
equivalence of the source and compiled program, by proving correctness of each 
transformation in the compilation process. Wijs and Engelen [66] and De Putter 
and Wijs [45] prove the preservation of functional properties over transformations 
on models of concurrent systems. They prove preservation of model-independent 
properties. This approach differs from ours as they work on models instead of 
concrete programs. 


Compiler Optimization Correctness. Finally, there is related work that focusses 
on the compilation of sequential programs, performing transformations from 
high-level source code to lower-level machine code while preserving the seman- 
tics. These approaches neither consider parallelization, nor target different ar- 
chitectures. In GPU programming, the optimizations often need to be applied 
manually rather than during the compilation process. 

Namjoshi and Xu [41] use a proof checker to show equivalence between an 
original WebAssembly program and optimized program. An equivalence proof is 
generated based on the transformations. Namjoshi and Singhania [40] created a 
semi-automatic loop optimizer with user-directives. The loops are verified during 
compilation. For each transformation, semantics are defined to guarantee seman- 
tical equivalence to the original program. Namjoshi and Pavlinovic [39] focus on 
recovering from precision loss due to semantics-preserving program transforma- 
tions and propose systematic approaches to simplify analysis of the transformed 
program. Finally Gjomemo et al. [20] help compiler optimizations by supplying 
high-level information gathered by external static analysis (e.g., Frama-C). This 
information is used by the compiler for better reasoning. 


7 Conclusion 


In this paper, we presented ALPINIST, the annotation-aware GPU program opti- 
mizer. Given an unoptimized, annotated GPU program, we showed how ALPIN- 
IST transforms both the code and the annotations, with the goal to preserve the 
provability of the optimized GPU program. ALPINIST supports loop unrolling, 
tiling, kernel fusion, iteration merging, matrix linearization and data prefetch- 
ing, of which the first three are discussed in detail. We discussed the design and 
implementation of ALPINIST, and we validated it by verifying a set of examples 
and reverifying their optimized counterparts. 

For future work, there are other optimizations that could be supported, such 
as data prefetching for all memory patterns as mentioned by Ayers et al. [4]. 
Another open question is if and how this approach can be used in program 
compilation. We also plan to extend this approach to preserve the provability 
of transpiled code, e.g., CUDA to OpenCL conversions. Moreover, we plan to 
investigate how ALPINIST can be combined with techniques such as autotuning 
that automatically detect the potential for applying specific optimizations and 
identify optimal parameter configurations [3,63]. 
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Abstract. Debugging imperative network programs is a difficult task 
for operators as it requires understanding various network modules and 
complicated data structures. For this purpose, this paper presents an au- 
tomated technique for repairing network programs with respect to unit 
tests. Given as input a faulty network program and a set of unit tests, 
our approach localizes the fault through symbolic reasoning, and synthe- 
sizes a patch ensuring that the repaired program passes all unit tests. It 
applies domain-specific abstraction to simplify network data structures 
and exploits function summary reuse for modular symbolic analysis. We 
have implemented the proposed techniques in a tool called NETREP and 
evaluated it on 10 benchmarks adapted from real-world software-defined 
network controllers. The evaluation results demonstrate the effectiveness 
and efficiency of NETREP for repairing network programs. 


1 Introduction 


Emerging tools for program synthesis and repair facilitate automation of pro- 
gramming tasks in various domains. For example, in the domain of end-user 
programming, synthesis techniques allow users without any programming expe- 
rience to generate scripts from examples for extracting, wrangling, and manip- 
ulating data in spreadsheets [13,40]. In computer-aided education, repair tech- 
niques are capable of providing feedback on programming assignments to novice 
programmers and help them improve programming skills [49,14]. In software 
development, synthesis and repair techniques aim to reduce the manual efforts 
in various tasks, including code completion [43,10], application refactoring [42], 
program parallelization [8], bug detection [11,41], and patch generation [11,32]. 

As an emerging domain, Software-Defined Networking (SDN) offers the in- 
frastructure for monitoring network status and managing network resources 
based on programmable software, replacing traditional specialized hardware in 
communication devices. Since SDN provides an opportunity to dynamically mod- 
ify the traffic handling policies on programmable routers, this technology has 
witnessed growing industrial adoption. However, using SDNs involves many pro- 
gramming tasks that are inevitably susceptible to programmer errors leading to 
bugs [3,23]. For example, a device with incorrect routing policies could forward a 
packet to undesired destinations, and a buggy firewall rule may make the entire 
network system vulnerable to security threats. 
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In the SDN framework, a logically centralized control plane generates rules 
that are installed into data planes, which in turn decides the routing of packets 
throughout the network. While network verification is a well-studied field where 
operators can be hinted on incorrectly installed rules [3,4,22], little prior work 
has explored the problem of automatically repairing the corresponding bug in the 
control plane, especially those written in widely used general-purpose languages 
such as Java or Python. Existing work mostly restricts the target to control 
plane programs written in domain-specific languages such as Datalog [51,17]. 

Since networks cannot tolerate even small mistakes, and most network oper- 
ators are not trained in programming skills, debugging and repair tools in this 
domain should prioritize accuracy and automation. This means that many exist- 
ing techniques for general program repair are not suitable to this domain as they 
trade off accuracy for heuristics for scaling with the size of analyzed programs 
and number of discovered potential bugs. 

Motivated by the demand for automated repair and the limitations of ex- 
isting techniques, we develop a precise and scalable program repair technique 
for network programs. Specifically, our repair technique takes as input a net- 
work program and a set of unit tests, reveals the program location that causes 
the test failure, and automatically generates a patch to fix the program. In the 
setting of SDN, a unit test corresponds to an incorrectly installed routing rule 
generated by the control plane from a reported packet. Such unit tests can be 
discovered by a separate network verification procedure [3,4,22]. 

Our main idea is to use symbolic reasoning using constraints capturing the 
semantics of the program for accurate repair, and modular analysis to improve 
the efficiency. We extended the encoding techniques from prior work [21,12] to 
support object-oriented features in Java. We also developed a new approach to 
focus the analysis on one function at a time and gradually narrow down the range 
of faulty statements along with the specification for the expected behavior. 

The proposed technique is implemented in an automatic network program 
repair tool called NETREP. To evaluate NETREP, we adapt 10 benchmarks from 
real-world faulty network programs in Floodlight that require changing up to 3 
lines of code to fix and apply NETREP to repair the benchmarks automatically. 
The experimental results show that NETREP is able to find a repair that passes 
all unit tests for faulty programs up to 738 lines of code for 8 benchmarks using 
2 or 3 test cases, outperforming a state-of-the-art repair tool for general Java 
programs. Furthermore, NETREP is efficient in terms of repair time, requiring 
only an average running time of 744 seconds across all benchmarks. 


Contributions. We make the following main contributions in this paper: 


— We present an automated program repair technique that aims to help net- 
work operators debug and fix network controller programs automatically. 

— We describe a bug localization approach based on symbolic execution and 
constraint solving for programs with imperative object-oriented features such 
as virtual function calls. 

— We propose novel modular analysis techniques to effectively scale up the 
symbolic reasoning for automatic repair. 
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1 (network public class MacAddr { 
2 private long value; 


private MacAddr(long v) { value = v; } 
1 public static MacAddr NONE = new MacAddr (0); 

public static MacAddr of(long v) {return new MacAddr(v);) 
6 } 


7 public class FirewallRule { 
8 public MacAddr dl_dst; public boolean any_dl_dst; 


9 public FirewallRule() { 


10 dl dst = MacAddr.NONE; any.dl dst = true; ... } 
11 public boolean isSameAs(FirewallRule r) ( 

12 if (C... || any_dl_dst != r.any dl dst 

13 || Cany_dl_dst == false && 

14 dl dst != r.dl dst)) { 


15 return false; } 
16 return true; } 


17 Pes 
Fig. 1: Code snippet about a bug in Floodlight. 


1 public boolean test(long maci, long mac2) { 


2 FirewallRule ri = new FirewallRule(); 
ri.dl dst = MacAddr.of(maci); ri.any_dl_dst = false; 
4 FirewallRule r2 = new FirewallRule(); 


r2.dl_dst = MacAddr.of(mac2); r2.any_dl_dst = false; 
6 return ri.isSameAs(r2); } 


Fig. 2: Unit test that reveals the bug in FirewallRule. 


— We develop a tool called NETREP based on the proposed techniques and 
evaluate it using 10 benchmarks adapted from real-world network programs. 
The evaluation results demonstrate that NETREP is effective for bug local- 
ization and able to generate correct patches for realistic network programs. 


2 Overview 

In this section, we give a high-level overview of our repair techniques and 
walk through the NETREP tool using an example adapted from the Floodlight 
SDN controller [9]. 

Figure 1 shows a simplified code snippet about firewall rules in Floodlight. 
Specifically, the program consists of two classes — FirewallRule and MacAddr. 
'The FirewallRule class describes rules enforced by the firewall, including infor- 
mation about source and destination mac addresses. The MacAddr class is an 
auxiliary data structure that stores the raw value of mac addresses ?. 

'The network program shown in Figure 1 is problematic because the isSameAs 
function compares two mac addresses using the != operator rather than a nega- 
tion of the equals functions. The != operator only compares two objects based 
on their memory addresses, whereas the intent of the developer is to check if two 
mac addresses have the same raw value. The bug is revealed by the unit test 
in Figure 2, then confirmed and fixed by the Floodlight developers ^. Next, let 


3 A unique 48-bit number that identifies each network device. 
- https://github.com/floodlight /floodlight /commit /4d528e4bf5f02c59347bb9cObeb1b875ba2c821e 


356 L. Shi et al. 


us illustrate how NETREP localizes this bug based on unit tests test (1, 2) = 
false and test(1, 1) = true and automatically synthesizes a patch to fix it. 


At a high level, NETREP enters a loop that iteratively attempts to find the 
fault location and synthesize the patch. Since our repair technique works in a 
modular fashion, NETREP first selects a function F in the program and tries 
to repair each possible fault location at a time. If NETREP cannot synthesize a 
patch consistent with the provided unit tests for any potential fault location in 
F, it backtracks and selects the next function and repeats the same process until 
all possible functions are checked. We now describe the experience of running 
NETREP on our illustrative example. 


Iteration 1. NETREP selects the constructor of FirewallRule as the target func- 
tion. Fault localization determines that the fault is located at the dl.dst - 
MacAddr.NONE part of Line 10, because it is related to the equality checking in 
the unit test. However, it is not the fault location. NETREP tries to synthesize 
a patch that passes all unit tests to replace this statement, but fails. 


Iteration 2. NETREP selects the same function — constructor of FirewallRule, 
but the fault localization switches to a different statement any dl dst = true 
at Line 10. Similar to Iteration 1, the synthesizer cannot generate a correct patch 
by replacing this statement. 


Iteration 3. Since none of the statements in the constructor is the fault loca- 
tion, NETREP now selects a different function: isSameAs. The fault localization 
determines that any.dl.dst - false at Line 13 may be the fault location as it 
may affect the testing results. However, having tried to replace the statement 
with many other candidate statements, e.g., r.any. dl dst = false,any dl dst 
= true, the synthesizer still fails to generate the correct patch. 


Last iteration. Finally, after several attempts to localize the fault, NETREP 
identifies the fault lies in dl dst != r.dl dst at Line 14, which is indeed the 
reported bug location. At this time, the synthesizer manages to generate a correct 
patch !d1 dst.equals(r.dl. dst). Replacing the original condition at Line 14 
with this patch results in a program that can pass all the provided test cases, so 
NETREP has successfully repaired the original faulty program. 


3 Preliminaries 


In this section, we present the language of network programs and describe a 
program formalism that is used in the rest of paper. We also define the program 
repair problem that we want to solve. 


3.1 Language of Network Programs 


The language of network programs considered in this paper is summarized in 
Figure 3. A network program consists of a set of classes, where each class has an 
optional annotation @network to denote that the class can benefit from network 
domain-specific abstraction. 
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Prog P ::= Ct Stmt s ::= l := e | jmp (e) L 
Class C ::= @network? class C {at F+} | ret v |z := new C 
Func F ::= function f(zi,...,24) (L : s)t | z:— C.f(vi,..., vn) 

Expr e ::= l | c | op(ei,..., ea) | w:=y.f(vi,...,Un) 
LValue l ::— x | z.a | x[v] Imm v :—zc|c 


x,y € Variable cc Constant L € LineID 
C € ClassName f,fo€ FuncName ac FieldName 


Fig.3: Syntax of network programs. 


Each class in the program consists of a list of fields and functions. Each func- 
tion has a name, a parameter list, and a function body. The function body is a 
list of statements, where each statement is labeled with its line number. Various 
kinds of statements are included in our language of network programs. Specif- 
ically, assign statement l :— e assigns expression e to left value l. Conditional 
jump statement jmp (e) L first evaluates predicate e. If the result is true, then 
the control flow jumps to line L; otherwise, it performs no operation. Note that 
our language does not have traditional if statements or loop statements, but 
those statements can be expressed using conditional jumps. ? 

Return statement ret v exits the current function with return value v. New 
statement x :— new C creates an object of class C and assigns the object address 


to variable x. Static call x := C.f(vi,..., Un) invokes the static function f in class 
C with arguments v,..., Un and assigns the return value to variable x. Similarly, 
virtual call x :— y.f(v1,...,Un) invokes the virtual function f on receiver object 
y with arguments v,..., v, and assigns the return value to variable x. Different 


kinds of expressions are supported including constants, variable accesses, field 
accesses, array accesses, arithmetic operations, and logical operations. Since the 
semantics of network programs is similar to that of traditional programs written 
in object-oriented languages, we omit the formal description of semantics. 

In addition, we assume each statement in the program is labeled with a 
globally unique line number, and line numbers are consecutive within a function. 


3.2 Problem Statement 


We assume a unit test t is written in the form of a pair (7, O), where I is the 
input and O is the expected output. Given a network program 7 and a unit 
test t = (1, O), we say P passes the test t if executing P on input I yields the 
expected output O, denoted by [P]; = O. Otherwise, if [P]; Z O, we say P 
fails the test t. In general, given a network program P and a set of unit tests E, 
program 7 is faulty modulo € if there exists a test t € E such that P fails on t. 

Now let us turn the attention to the meaning of fault locations and patches. 


Definition 1 (Fault location and patch). Let P be a program that is faulty 
modulo tests £. Line L is called the fault location of P, if there exists a statement 


5 Our repair techniques only handle bounded loops. If there are unbounded loops in 
the network program, we need to perform loop unrolling. 
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Algorithm 1 Modular Program Repair 


1: procedure REPAIR(P, £) 


Input: Program P, examples E€ 
Output: Repaired program P’ or L to indicate failure 


2: P + Abstraction(P); 

3: Y + {L > false | L € Lines(P)}; P' + L; 

4: while P’ = | do 

5: F © SelectFunction(P, Y); 

6: if F = | then return 1; 

7: V, P’ — REPAIRFUNCTION(P, F, €, V); 

8: return P’; 

9: procedure REPAIRFUNCTION(P, F, E, V) 
Input: Program P, function F, examples €, visited map V 
Output: Updated visited map V, repaired program P’ 

10: Pel; 

Ti: while P’ = | do 

12: L + LocalizeFault(P, F, E, V); 

13: if DA L then 

14: V — V[L > true]; 

15: else 

16: V + V[I/ > true | TranslnFunc(L/, P, F)]; 

1T: if L = L or IsCallStmt(P, L) then return V, 1; 

18: P’ + SynthesizePatch(P, E, F, L); 


19: return V, P’; 


s such that replacing line L of P with s yields a new program that can pass all 
tests in E. Here, the statement s is called a patch to P. 


Problem statement. Given a network program 7 that is faulty modulo tests 
£, our goal is to find a fault location L in and generate the corresponding 
patch s, such that for any unit test t € E, the patched program P’ can always 
pass the test t. 


4 Modular Program Repair 


In this section, we present our algorithm for automatically repairing network 
programs from a set of unit tests. 


4.1 Algorithm Overview 


The top-level repair algorithm is described in Algorithm 1. The REPAIR proce- 
dure takes as input a faulty network program P and unit tests € and produces 
as output a repaired program P’ or | to indicate repair failure. 

At a high level, the REPAIR procedure maintains a visited map V from line 
numbers to boolean values, representing whether each line of is checked or not. 
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'The REPAIR procedure first applies the domain-specific abstraction to program 
P (Line 2) and initializes the visited map V by setting every line in P as not 
checked (Line 3). Next, it tries to iteratively repair P in a modular way until it 
finds a program P’ that is not faulty modulo tests € (Lines 4 — 8). In particular, 
the REPAIR procedure invokes SelectFunction to choose a function F as the target 
of repair (Line 5). If none of the functions in P can be repaired, it returns L 
to indicate that the repair procedure failed (Line 6). Otherwise, it invokes the 
REPAIRFUNCTION procedure (Line 7) to enter the localization-synthesis loop 
inside the target function F. 

In addition to the program 7? and tests £, the REPAIRFUNCTION procedure 
takes as input a target function F and the current visited map V. It produces as 
output the updated version of the visited map V, as well as a repaired program 
P’ or L to indicate that the function F cannot be repaired. As shown in Lines 
11 — 18 of Algorithm 1, REPAIRFUNCTION alternatively invokes sub-procedures 
LocalizeFault and SynthesizePatch to repair the target function. In particular, the 
goal of LocalizeFault is to identify a fault location in function F. If LocalizeFault 
manages to find a fault location L in F, then line L is marked as visited (Line 
14). Otherwise, if LocalizeFault returns .L, it means function F and all functions 
transitively invoked in F are correct or not repairable. In this case, all lines in 
F and its transitive callees are marked as checked (Line 16). Furthermore, if 
the identified fault location L corresponds to a statement that invokes F", it 
means the fault location is inside F’. Thus, REPAIRFUNCTION directly returns 
L (Line 17) and SelectFunction will choose F” as the target function in the next 
iteration. On the other hand, the goal of the sub-procedure SynthesizePatch is 
generating a patch for function F given the fault location L. If SynthesizePatch 
successfully synthesizes a patch and produces a non-faulty program P’, then the 
entire procedure succeeds with repaired program P’. Otherwise, REPAIRFUNC- 
TION backtracks with a new program location and repeat the same process. 

In the rest of this section, we explain fault localization, modular analysis, 
and patch synthesis in more detail. 


4.2 Fault Localization 


Next, we give a high-level description of our fault localization technique that 
aims to find the fault location in a given program. This corresponds to the 
LocalizeFault procedure in Algorithm 1. We will first show how to encode the 
problem on an entire program, and then explain how the analysis can be made 
modular to boost the performance. 

At a high level, our fault localization technique uses a symbolic approach 
by reducing the fault localization problem into a constraint solving problem. In 
particular, we introduce a boolean variable for each line L, denoted by B[L], and 
encode the fault localization problem as an SMT formula, such that the value 
of the variable B[L] indicates whether line L is correct or not. 


Checking faulty programs. To understand how to encode the fault localiza- 
tion problem, let us first explain how to encode the consistency check given a 
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program P and a test case t = (1, O). Specifically, the encoded SMT formula 
P(t) consists of three components: 


1. Semantic constraints. For each line L; : s;, we generate a formula $;(S, S") to 
describe the semantics of the statement s;. Specifically, given a state S that 
holds before statement s;, ;(S, S") is valid if S" is the state after executing 
s;. There are two parts of the constraint: the memory contents that are 
changed, and the memory contents that are preserved. For example, in case 
of an assignment statement, the constraint will claim that 1) the evaluation 
result of the right value in state S equals to the left value in state S’, and 
2) all values except for the left value are the same in S and S". 

2. Control flow integrity constraints. In order to ensure all traces satisfying the 
constraint faithfully follow the control flow structure of a given program P, 
we generate another set of formulae 9 ;. Specifically, we require that any line 
of code that is executed must have exactly one predecessor and one successor 
that are executed, and the branch condition in the code must be respected 
when picking the successor. This guarantees that there is exactly one valid 
execution trace corresponding to one test case, 

3. Consistency between program and test. For the provided test case t = (1, O), 
we also generate formula $;,(So, I) and ®ouz(Sn,O) to ensure the program 
behavior is consistent with the test. In particular, Pin (So, I) binds input I to 
the initial state So and $,,,(S,, O) describes the connection between output 
O and final state S,,. 


The satisfiability of formula (t) indicates the result of consistency check. 
If S(t) is satisfiable, the solver generates a feasible execution trace and an as- 
signment of all intermediate states along this trace. In this case, program P can 
pass the test t because there exists a valid trace following the control flow and 
every pair of adjacent states in the trace is consistent with the semantics of the 
corresponding statement. Otherwise, if &(t) is unsatisfiable, P fails the test t. 

Now to check whether P against a set of unit tests E, we can conjoin the for- 
mula $(t;) for each unit test t; € E and obtain the conjunction ® = A, eg QI). 


The satisfiability of formula 4 indicates whether P is faulty modulo tests £ 9. 


Methodology of fault localization. Let P be a faulty program modulo €, 
we know the corresponding formula ® for consistency check is unsatisfiable. 
Suppose the fault location is line L;, one key insight is that replacing the semantic 
constraint $;(S, S") with true yields a satisfiable formula. This is because true 
does not enforce any constraint between the pre-state S and post-state S’, so a 
previously invalid trace caused by the bug at L becomes valid now. 

Based on this insight, we develop a methodology to find the fault location 
using symbolic reasoning. Specifically, given a consistency check formula ®, we 
can obtain a fault localization formula ©’ by replacing the semantic constraint 
P(S, S") with B[L;] > 4,(5, S") for every line L;,i € [1, n]. Here, variable B[L;] 
decides whether or not it turns the semantic constraint of L; into true. Thus, 
B[L;] = false indicates L; is a fault location. 


ê The encoding is described in more detail in the extended version [46]. 
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One hiccup here is that formula 4' is always satisfiable and a model of & 
can simply assign B[L;] — false for all L;. It means all lines in the program are 
fault locations, which is not useful for fault localization. To address this issue, 
we can add a cardinality constraint stating there are exactly K variables in map 
B that can be assigned to false, which forces the constraint solver to find exactly 
K fault locations in program 7. 


Modular analysis. The method above can precisely compute a potential fault 
location. But an obvious shortcoming is it is hard to scale. Encoding a long pro- 
gram involves 1) a large number of semantic constraints, 2) many fault location 
choices, as well as 3) many intermediate states to be assigned. 

Notice that although a program can be arbitrarily long, developers usually 
follow the design practice that every function is of limited size. Focusing on 
analyzing one function at a time and recursively search for the final fault location 
could be way more efficient than solving one NP-hard problem at the entire 
program's scale. 

To facilitate modular analysis of a function, we need to summarize the be- 
havior of its sub-modules (callee functions) and infer external specification from 
its higher-level module (caller function). 

'The encoding method introduced above treats one line of code as a constraint 
on its pre-state and post-state. To summarize the behavior of a callee function, 
we aim to turn it into a similar constraint on the pre-state and post-state for the 
calling statement. The inner states of this callee function should be skipped in 
the encoding. We can compute such summaries of the target function's callees 
by symbolic execution. We start with a symbolic representation of the pre-state 
and execute the callee function until it returns, and claim that the output state 
equals the post-state. In this way, we can entirely eliminate all bug location 
choices and inner state assignments in the callee function, as well as greatly 
simplifying the semantic constraint. 

There are two ways to infer the specification of target function. The first way 
is to encode only the calling stack of the target function up until the top-level 
function, where we can use the test case as the specification. All function calls 
made by the target's caller and transitive callers that are not in the stack can be 
replaced by the automatically computed summary. We can also disable all fault 
location choices except for lines in the target function. Another way is to infer a 
possible pre-condition and post-condition of the target function. From the per- 
spective of the caller, the target function is a line of code that puts an incorrect 
constraint on its pre-state and post-state. After the analysis, the constraint solver 
will infer a feasible pre-state and post-state assuming this incorrect constraint is 
removed. This assignment can be used as the pre-condition and post-condition, 
which eliminates the need to encode any caller function. Since the second ap- 
proach will possibly introduce incompleteness into the analysis, we use it only to 
infer a specification to synthesize the final patch, and use the first one for every 
function's analysis. 


Domain-specific abstraction. A domain-specific abstraction is essentially a 
function summary as discussed above. But for those repeatedly used network 
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classes (identified by the Gnetwork annotation), we can pre-define some more 
succinct abstractions based on domain knowledge to make the analysis easier. 
The abstraction .A[F] of a function F is an over-approximation of F that is 
precise enough to characterize the behavior of F. 

The abstraction is useful due to two observations. First, source code for 
network programs may only be partially available due to the use of high-level 
interface and native implementation. For example, when comparing the equality 
between two network addresses, the getClass function is frequently used, but 
its implementation depends on the runtime and is not available. To make the 
analysis easier, we can instead use the following abstraction for such comparison: 


Alequals] : Ax. Ay. (a.dtype = y.dtype ^ x.value = y.value), 


where z.dtype denotes the dynamic type of the object x. 

Second, network programs have complex operations that are challenging for 
symbolic reasoning. For instance, bit manipulations are heavily used in network 
data structures. While bit manipulations can improve the performance of net- 
work programs, they present significant challenges for symbolic analysis due to 
the encoding in the theory of bitvectors. We can give an abstraction equivalent in 
correctness but simpler in the behavior, e.g., using the identity function instead 
of a hash code computation. 


4.3 Patch Synthesis 


The last step of our repair algorithm is to generate a patch to fix the faulty 
program. This corresponds to the SynthesizePatch procedure in Algorithm 1. It 
can be reduced to a sketch finishing problem in program synthesis where we 
replace the existing faulty line with a hole. 

Our general idea is to use plain enumerative search with a depth bound in 
the candidate patch's space, but with two significant optimizations. 

First, we reduce the search space with heuristics. On one hand, we only re- 
place the core expression in the faulty statement with a hole to focus on the most 
expressive part. 'To be specific, we consider changing the right-hand-side expres- 
sions of assignments, conditional expressions of jump statements, return values 
of return statements, and functions and arguments for function invocations. On 
the other hand, we use a limited grammar to guide the search. We parameterize 
all constants, variables, fields, functions, and operators over the sketch and only 
instantiate constructs that are in scope. For example, given a particular sketch 
with a hole, we only populate the variable set with all local and global variables 
that are in scope of the hole. Also, if the hole corresponds to the conditional 
expression of a if statement, we only add logical operators to the grammar. 

Second, we use the local specification to guide the synthesis. Sketch comple- 
tion is different from synthesizing a complete program in that the specification 
is defined for the entire program. We have to repeatedly waste time on executing 
the correct part of the program to verify a candidate patch. We use the tech- 
nique described in the modular analysis section to generate a pre-condition and 
post-condition for only the faulty line. In this way, only the generated patch 
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needs to be executed to verify against the specification, which greatly saves time 
when the program grows larger. 


5 Implementation 


We have implemented the proposed repair technique in a tool called NETREP. 
NETREP leverages the Soot static analysis framework [26] to convert Java pro- 
grams into Jimple code, which provides a succinct yet expressive set of instruc- 
tions for analysis. In addition, NETREP utilizes the Rosette tool [48] to perform 
symbolic reasoning for fault localization and patch synthesis. While our imple- 
mentation closely follows the algorithm presented in Section 4, we also conduct 
several optimizations important to improve the performance of NETREP. 
Memories for different types. Since the conversion between bitvectors and in- 
tegers imposes significant overhead on running time, NETREP divides the mem- 
ory into one part for integers and another for bitvectors. In this design, NETREP 
automatically selects the memory chunk based on the variable types. The type 
checking can guarantee that no such conversion will exist. 

Stack and heap. In order to reduce the number of memory operations, NE- 
TREP also divides the memory into stack and heap. As is standard, stack only 
stores static data and its layout is deterministic. Therefore, stacks are imple- 
mented using fixed-size vectors, and thus can be efficiently accessed for read and 
write operations. On the other hand, heap stores dynamic data that are usually 
not known at compile time, such as allocated objects. Since the heap size can- 
not be determined beforehand, NETREP uses an uninterpreted function f(x) to 
represent heaps, where x is the address and f(x) is the value stored at x. 
String values. Since reasoning over string values is a challenging task and not 
always necessary for repairing network programs, we simplified the representa- 
tion of strings with integer values. Specifically, NETREP maps each string literal 
to a unique integer and represents all string operations (e.g. concatenation) with 
uninterpreted functions. 

Bounded program analysis. In order to improve the repair time, NETREP 
only performs bounded program analysis for fault localization and patch syn- 
thesis. Namely, we unroll loops and inline functions up to K times, where K is a 
predefined hyper-parameter. In this way, function summaries can be easily and 
efficiently computed using symbolic execution. 


6 Evaluation 


'To evaluate the proposed techniques, we perform experiments that are designed 
to answer the following research questions: 


RQ1 Is NETREP effective to repair realistic network programs? 

RQ2 How efficient are the fault localization and repair techniques in NETREP? 

RQ3 How helpful are modular analysis and domain-specific abstraction for repair- 
ing network programs? 
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Loc Synth Total 


ID Module LOC # Funcs # Tests Succ Exp Time (s) Time (s) Time (s) 

1 DHCP 212 17 2 Yes Yes 40 117 157 
2 Load Balancer 336 28 2 No No - - - 
3 Firewall 262 13 2 Yes Yes 893 197 1090 
4 DHCP 431 32 2 Yes Yes 95 39 134 
5 Utility 809 65 2 No No - - - 
6 Routing 605 44 3 Yes Yes 271 179 450 
7 Utility 454 45 2 Yes Yes 39 46 85 
8 Learning Switch 738 34 2 Yes No 571 595 1166 
9 Database 442 i7 2 Yes No 310 2139 2449 

10 Link Discovery 671 46 2 Yes No 268 158 426 


Table 1: Experimental results of NETREP. 


RQ4 How is NETREP compared to other repair tools for Java programs? 


Benchmark collection. To obtain realistic benchmarks, we crawl the commit 
history of Floodlight [9], a representative open-source SDN controller in Java 
that supports the OpenFlow protocol and a rich set of network functions. To 
distinguish commits caused by bug repairs from those generated for non-repair 
scenarios, we identify commits based on the following criteria: 1) The commit 
message contains keywords about repairing bugs, e.g., “bug”, “error”, “fix”; 2) 
The commit changes no more than three lines of code. 

Following these criteria, we have collected 10 commits from the Floodlight 

repository and adapted them into our benchmarks. Specifically, given a commit 
in the repository, we take the code before the commit as the faulty network 
program and the version after the commit as the ground-truth repaired program. 
The code is post-processed and the parts irrelevant to the bug of interest are 
removed. We also identify corresponding unit tests and modify them to directly 
reveal the bug as appropriate. Each benchmark in our evaluation consists of a 
faulty network program and its corresponding unit tests. 
Experimental setup. All experiments are conducted on a computer with 4-core 
2.80GHz CPU and 16GB of physical memory, running the Arch Linux Operating 
system. We use Racket v7.7 as the compiler and runtime system of NETREP and 
set a time limit of 1 hour for each benchmark. 


6.1 Main Results 


Our main experimental results are summarized in Table 1. The column labeled 
“Module” describes the network module to which the benchmark belongs. The 
next two columns labeled “LOC” and “# Funcs” show the number of lines 
of source code (in Jimple) and the number of functions, respectively. The “# 
Tests” column presents the number of unit tests used for fault localization and 
patch synthesis. Next, the “Succ” and “Exp” columns show whether NETREP 
can successfully repair the program and if the generated patch is exactly the 
same as the ground-truth. Since NETREP returns the first fix that can pass all 
provided test cases, the repaired programs are not necessarily the same as those 
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expected in the ground-truth. In this case, the table will show a “Yes” in the 
“Succ” column and a “No” in the “Exp” column. Finally, the last three columns 
in Table 1 denote the fault localization time, patch synthesis time and the total 
running time of NETREP. 

As shown in Table 1, there is a range of 13 to 65 functions in each benchmark 
and the average number of functions is 34 across all benchmarks. Each bench- 
mark has 212 — 809 lines of Jimple code, with the average being 496. NETREP 
succeeds in repairing 8 out of 10 benchmarks. Furthermore, for 5 benchmarks 
that can be successfully repaired, NETREP is able to generate exactly the same 
fix as ground-truth. Given that our benchmarks cover programs from a variety 
of modules of Floodlight, such as DHCP Server, Firewall, etc, we believe that 
NETREP is effective to repair realistic network programs (RQ1). 

We inspected the reason why NETREP fails to repair benchmarks 2 and 5. 
NETREP is not able to localize the fault in benchmark 2 due to its incomplete 
support for unbounded data structures with dynamic allocation such as hash 
map. For Benchmark 5, NETREP is able to localize the fault but not able to 
synthesize the correct patch. This is because the expected function to be invoked 
has side effects with another function, which needs some improvements in the 
specification checking to verify. 

Regarding the efficiency, NETREP can repair 8 benchmarks in an average of 
744 seconds with only 2 to 3 test cases. The fault localization time ranges from 
39 seconds to 893 seconds, with 5096 of the benchmarks within five minutes. The 
patch synthesis time ranges from 39 seconds to 2139 seconds, with 6096 of the 
benchmarks within five minutes. In summary, the evaluation results show that 
NETREP only takes minutes to localize bugs in a faulty program and synthesize 
a correct patch based on two to three unit tests (RQ2). 


6.2 Ablation Study 


'To explore the impact of modular analysis and domain-specific abstraction on 
the proposed repair technique, we develop three variants of NETREP: 


— NETREP-NOoMoOD is a variant of NETREP without modular analysis. Specif- 
ically, NETREP-NoMo»n inlines the functions in a given program but still 
uses abstractions for network data structures for fault localization and patch 
synthesis. 

— NETREP-NOAÀBS is a variant of NETREP without domain-specific abstraction. 
In particular, NETREP-NOABS uses the original concrete implementation of 
network functions for symbolic reasoning. If the implementation is written in 
a different language, we manually translate the implementation to Java. 

— NETREP-NOMODABS is a variant of NETREP without modular analysis or 
domain-specific abstraction. NETREP-NOMODABS simply inlines all func- 
tions in the faulty program, including those in the network data structures, 
and performs symbolic analysis for fault localization and patch synthesis. 


To understand the impact of modular analysis and domain-specific abstrac- 
tion, we run all variants on the 10 collected benchmarks. For each variant, we 
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Fig. 4: Comparing NETREP against three variants. 


measure the total running time (including time for fault localization and time 
for patch synthesis) on each benchmark, and order the results by running time 
in increasing order. The results for all variants are depicted in Figure 4. All lines 
stop at the last benchmark that the corresponding variant can solve within 1 
hour time limit. 

As shown in Figure 4, both NETREP-NOABS and NETREP-NOMOD can only 
solve 4 out of 10 benchmarks in the evaluation, with the average running time 
being 569 seconds and 610 seconds, respectively. NETREP-NOMODABS solves 
the least number of benchmarks: 3 out of 10. For the ones that it can solve, 
the average running time is 1165 seconds. This experiment shows that modu- 
lar analysis and domain-specific abstraction are both great boost to NETREP's 
efficiency to repair network programs (RQ3). 


6.3 Comparison with the Baseline 


'To understand how NETREP performs compared to other Java program repair 
tools, we compare NETREP against a state-of-the-art tool called JAID [5] on our 
benchmarks. Specifically, JAID takes as input a faulty Java program, a set of 
unit tests, and a function signature for fault localization and patch synthesis, 
a setting closest to NETREP among a variety of tools. Note that JAID solves a 
simpler repair problem than NETREP, because it requires the user to specify a 
function that is potentially incorrect in the program, whereas NETREP does not 
need input other than the faulty program and unit tests. In order to run JAID 
on our benchmarks, we adjust their formats to fit JAID's and provide the faulty 
function (known from the ground truth) as input for JAID. 

JAID will indefinitely enumerate all possible patches, rather than recommend- 
ing a most correct one. We think it is successful if the expected patch can be 
found among the results. In practice, human assistance is needed to pick out this 
patch from the thousands of candidates. 

As a result, JAID is able to finish on 8 out of 10 benchmarks. The expected 
patches are found among 2 of them, whereas NETREP can give the expected 
result for 5 benchmarks on the first recommendation. For one benchmark, JAID 
is unable to fix. For another one, it runs out of memory. 

We argue that NETREP is better suited for automatically repairing network 
programs compared to JAID. First, it only requires network operators to provide 
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unit test cases. As is discussed above, they can be automatically discovered by 
another verification or testing procedure. In comparison, JAID requires users to 
have skill of programming network controllers to identify the buggy function 
and pick the correct patch from the results. This is beyond the ability of most 
network operators and starts to require an expert team. Second NETREP has 
higher repairing accuracy. As we discussed above, network is sensitive to small 
mistakes. High accuracy is crucial for a network to function correctly. 

In summary, NETREP is more effective in automatically fixing bugs in net- 
work programs compared to state-of-the-art repairing tools for Java programs, 
especially with respect to repairing accuracy and automation (RQ4). 


7 Related Work 


Automated program repair. Automated program repair is an active re- 
search area that aims to automatically fix the mistakes in programs based on 
specifications of correctness criteria [11,28,39,18], with a variety of applications 
such as aiding software development [34], finding security vulnerabilities [37], 
and teaching novice programmers [49,14]. Different techniques have been pro- 
posed to solve the automated program repair problem, including heuristics-based 
techniques [16,31], semantics-based techniques [37,27], and learning-based tech- 
niques [45,30,32,47]. NETREP is a semantics-based automated repair tool. Dif- 
ferent from prior work, NETREP is specialized to repair network programs based 
on modular analysis and network data structure abstractions. 


Fault localization. Researchers have developed various approaches to fault lo- 
calization, including spectrum-based, learning-based, and constraint-based tech- 
niques. Specifically, the spectrum-based techniques [27,1,2,7,44,6,19] perform 
fault localization by identifying which part of program is active during a run 
through execution profiles (called program spectrum). Learning-based techniques 
[29,53,54] typically train machine learning models to predict and rank possible 
fault locations. By contrast, constraint-based techniques [21,20,12] encode the 
semantics of problems as logical constraints and reduce the fault localization 
problem into constraint satisfaction problem. In spirit, NET REP uses a similar 
idea for fault localization. However, NETREP performs modular analysis and 
enables debugging programs involving object-oriented features, whereas prior 
work only analyzes the entire program in a C-like language. Besides, NETR EP 
reuses the fault localization result to speedup the patch synthesis while prior 
work mainly focuses on the fault localization step. 

Patch synthesis. Many synthesis algorithms have been developed for generat- 
ing patches, including enumerative search [27], constraint-based techniques [37], 
statistical model [52], machine learning [15], hint from existing code [25], and 
so on. In terms of patch synthesis, NETREP generates a context-free grammar 
from the context of fault locations and performs enumerative search based on the 
grammar to synthesize patches. It does not require machine learning model or 
statistical information for ranking all possible patches. However, it is conceivable 
that NETREP will benefit from the guidance of such ranking techniques. 


368 L. Shi et al. 


Verification and synthesis for SDN. In the networking domain, several ver- 
ification tools [3,33,23,24] have been proposed based on either model checking 
or theorem proving. For example, VERICON [3] performs deductive verification 
to verify the correctness of SDN programs specified by network-wide invari- 
ants on all admissible topologies. In addition to verification, synthesis tech- 
niques [36,35,38] have also been proposed to aid software-defined networking. 
NETREP aims to repair network programs automatically, which is a different 
problem than SDN verification or synthesis. 

Repair for network programs. Our work is most related to automated re- 
pair of network programs in the SDN domain [50,51,17]. Prior work about auto- 
repair [50,51] relies on using Datalog to capture the operational semantics of the 
target language to be repaired. The repair techniques work for domain-specific 
languages (e.g. Datalog or Ruby on Rails) with simple structure. Similarly, Ho- 
jjat et al. [17] propose a framework based on horn clause repair problem to 
help network operators fix faulty configurations. However, NETREP targets Java 
network programs with object-oriented features and more complex constructs, 
which cannot be handled by existing techniques. 


8 Limitations and Future Work 


We discuss several limitations of NETREP that we plan to improve in future 
work. First, NETREP repairs the faulty network program with the first correct 
patch that can pass all tests. A user interaction that resumes the synthesis can be 
introduced in case it is not intended by the user or a more formal specification. 

Second, patches that require complicated changes, e.g., those involving con- 
trol flow structures, are beyond NETREP’s ability. They make up 44% of our col- 
lection of bug-fixing commits. We envision that the challenge can be addressed 
by introducing more sophisticated patch synthesis techniques such as searching 
over a domain-specific language for edits. 

Third, in order to force symbolic execution to terminate in finite time, NE- 
TREP currently unrolls all loops in the network program, which may result in 
missing a potential bug. Loop invariant inference techniques can be leveraged to 
overcome this challenge and still guarantee termination. 


9 Conclusion 


In this paper, we have proposed an automated repair technique for network 
controller programs with unit tests as specifications. Our technique internally 
performs symbolic reasoning for bug localization and patch synthesis, optimized 
by network domain-specific abstractions and modular analysis to reduce encod- 
ing size. we have implemented a tool called NETREP and evaluated it on 10 
benchmarks adapted from the Floodlight framework. The experimental results 
demonstrate that NETREP is effective for repairing realistic network programs 
with moderate change sizes. 
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Abstract. The 11th edition of the Competition on Software Verification 
(SV-COMP 2022) provides the largest ever overview of tools for software 
verification. The competition is an annual comparative evaluation of fully 
automatic software verifiers for C and Java programs. The objective is 
to provide an overview of the state of the art in terms of effectiveness 
and efficiency of software verification, establish standards, provide a 
platform for exchange to developers of such tools, educate PhD students 
on reproducibility approaches and benchmarking, and provide computing 
resources to developers that do not have access to compute clusters. The 
competition consisted of 15648 verification tasks for C programs and 
586 verification tasks for Java programs. Each verification task consisted 
of a program and a property (reachability, memory safety, overflows, 
termination). The new category on data-race detection was introduced as 
demonstration category. SV-COMP 2022 had 47 participating verification 
systems from 33 teams from 11 countries. 
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1 Introduction 


This report is the 2022 edition of the series of competition reports (see footnote) 
that accompanies the competition, by explaining the process and rules, giving 
insights into some aspects of the competition (this time the focus is on trouble 
shooting and reproducing results on a small scale), and, most importantly, 
reporting the results of the comparative evaluation. The 11th Competition on 
Software Verification (SV-COMP, https: //sv-comp.sosy-lab.org/2022) is the largest 
comparative evaluation ever in this area. The objectives of the competitions were 
discussed earlier (1-4 [16]) and extended over the years (5-6 [17]): 


1. provide an overview of the state of the art in software-verification technology 
and increase visibility of the most recent software verifiers, 


This report extends previous reports on SV-COMP [10, 11, 12, 13, 14, 15, 16,17, 18]. 
Reproduction packages are available on Zenodo (see Table 4). 
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2. establish a repository of software-verification tasks that is publicly available 
for free use as standard benchmark suite for evaluating verification software, 

3. establish standards that make it possible to compare different verification 
tools, including a property language and formats for the results, 

4. accelerate the transfer of new verification technology to industrial practice 
by identifying the strengths of the various verifiers on a diverse set of tasks, 

5. educate PhD students and others on performing reproducible benchmarking, 
packaging tools, and running robust and accurate research experiments, and 

6. provide research teams that do not have sufficient computing resources with 
the opportunity to obtain experimental results on large benchmark sets. 


The SV-COMP 2020 report [17] discusses the achievements of the SV-COMP 
competition so far with respect to these objectives. 


Related Competitions. There are many competitions in the area of formal 
methods [9], because it is well-understood that competitions are a fair and 
accurate means to execute a comparative evaluation with involvement of the 
developing teams. We refer to a previous report [17] for a more detailed discussion 
and give here only the references to the most related competitions [20, 53, 67]. 


Quick Summary of Changes. While we try to keep the setup of the compe- 
tition stable, there are always improvements and developments. For the 2022 
edition, the following changes were made: 


e A demonstration category on data-race detection was added. Due to several 
participating verification tools, this category will become a normal main 
category in SV-COMP 2023. The results are outlined in Sect. 5. 

e New verification tasks were added, with an increase in C from 15 201 in 2021 
to 15648 in 2022 and in Java from 473 in 2021 to 586 in 2022, combined 
with ongoing efforts on quality improvement. 


2 Organization, Definitions, Formats, and Rules 


Procedure. The overall organization of the competition did not change in 
comparison to the earlier editions [10, 11, 12, 13, 14, 15, 16, 17, 18]. SV-COMP is 
an open competition (also known as comparative evaluation), where all verification 
tasks are known before the submission of the participating verifiers, which is 
necessary due to the complexity of the C language. The procedure is partitioned 
into the benchmark submission phase, the training phase, and the evaluation 
phase. The participants received the results of their verifier continuously via 
e-mail (for preruns and the final competition run), and the results were publicly 
announced on the competition web site after the teams inspected them. 


Competition Jury. Traditionally, the competition jury consists of the chair and 
one member of each participating team; the team-representing members circulate 
every year after the candidate-submission deadline. This committee reviews 
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the competition contribution papers and helps the organizer with resolving any 
disputes that might occur (from competition report of SV-COMP 2013 [11]). 
In more detail, the tasks of the jury consist of the following: 


e The jury oversees the process and ensures transparency, fairness, and com- 
munity involvement. 

e Each jury member who participates in the competition is assigned a number 
of (3 or 4) submissions (papers and systems) to review. 

e Participating systems are reviewed to determine whether they fulfill the 
requirements for verifier archives, based on the archives submitted to the 
repository. 

e Teams and paper submissions are reviewed to verify the requirements for 
qualification, based on the submission data and paper in EasyChair and the 
results of the qualification runs. 

e Some qualified competition candidates are selected to publish (in the LNCS 
proceedings of TACAS) a contribution paper that gives an overview of the 
participating system. 

e The jury helps the organizer with discussing and resolving any disputes that 
might occur. 

e Jury members adhere to the deadlines with all the duties. 


'The team representatives of the competition jury are listed in Table 5. 


License Requirements. Starting 2018, SV-COMP required that the verifier 
must be publicly available for download and has a license that 


(i) allows reproduction and evaluation by anybody (incl. results publication), 
(ii) does not restrict the usage of the verifier output (log files, witnesses), and 
(ii) allows any kind of (re-)distribution of the unmodified verifier archive. 


Two exceptions were made to allow minor incompatibilities for commercial 
participants: The jury felt that the rule “allows any kind of (re-)distribution of the 
unmodified verifier archive" is too broad. The idea of the rule was to maximize 
the possibilities for reproduction. Starting with SV-COMP 2023, this license 
requirement shall be changed to "allows (re-)distribution of the unmodified 
verifier archive via SV-COMP repositories and archives". 


Validation of Results. The validation of the verification results was done 
by eleven validation tools, which are listed in Table 1, including references to 
literature. Four new validators support the competition: 


new 


e There are two new validators for the C language: DARrTAGNAN *" supports 
result validation for violation witnesses in category ConcurrencySafety- Main. 
SvMBIOTIC-WrrCH ^" supports result validation for violation witnesses in 
categories ReachSafety, MemSafety, and NoOverflows. 

e For the first time, there are validators for the Java language: GWirr ^" and 
WiT4JAVA"*" support result validation for violation witnesses in category 
ReachSafety- Java. 


new 
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Table 1: Tools for witness-based result validation (validators) and witness linter 


Validator Reference Representative Affiliation 

CPACHECKER [25,26,28] Thomas Bunk LMU Munich, Germany 
CPA-w2T [27] Thomas Lemberger LMU Munich, Germany 
DARTAGNAN "^" [89] Hernán Ponce de León Bundeswehr U., Germany 
CPnovER-w2T [27] Michael Tautschnig Queen Mary U. London, UK 
GWir"*" [68] Falk Howar TU Dortmund U., Germany 
METAVAL [33] Martin Spiessl LMU Munich, Germany 
NiTWiT [105] Jana (Philipp) Berger RWTH Aachen, Germany 
SvuBrioric- Wrrcau "^". [6] Paulína Ayaziová Masaryk U., Brno, Czechia 
UAUTOMIZER [25, 26] Daniel Dietsch U. of Freiburg, Germany 
wir4Java"™ [108] Tong Wu U. of Manchester, UK 
WrrNESSLINT Sven Umbricht LMU Munich, Germany 


Table 2: Scoring schema for SV-COMP 2022 (unchanged from 2021 [18]) 


Reported result Points Description 


UNKNOWN 0 Failure to compute verification result 
FALSE correct +1 Violation of property in program was correctly found 
and a validator confirmed the result based on a witness 
FALSE incorrect —16 Violation reported but property holds (false alarm) 
TRUE correct +2 Program correctly reported to satisfy property 
and a validator confirmed the result based on a witness 
TRUE incorrect —32 Incorrect program reported as correct (wrong proof) 


Task-Definition Format 2.0. SV-COMP 2022 used the task-definition format 
in version 2.0. More details can be found in the report for Test-Comp 2021 [19]. 


Properties. Please see the 2015 competition report [13] for the definition of the 
properties and the property format. All specifications used in SV-COMP 2022 
are available in the directory c/properties/ of the benchmark repository. 


Categories. The (updated) category structure of SV-COMP 2022 is illustrated by 
Fig. 1. The categories are also listed in Tables 8, 9, and 10, and described in detail 
on the competition web site (https://sv-comp.sosy-1lab.org/2022/benchmarks.php). 
Compared to the category structure for SV-COMP 2021, we added the sub- 
category Termination-Bit Vectors to category Termination and the sub-category 
SoftwareSystems-BusyBoz-ReachSafety to category SoftwareSystems. 


Scoring Schema and Ranking. The scoring schema of SV-COMP 2022 was the 
same as for SV-COMP 2021. Table 2 provides an overview and Fig. 2 visually illus- 
trates the score assignment for the reachability property as an example. As before, 
the rank of a verifier was decided based on the sum of points (normalized for meta 
categories). In case of a tie, the rank was decided based on success run time, which 
is the total CPU time over all verification tasks for which the verifier reported 
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Fig. 1: Category structure for SV-COMP 2022; category C-FalsificationOverall 
contains all verification tasks of C-Overall without Termination; Java- Overall 
contains all Java verification tasks; compared to SV-COMP 2021, there is one 
new sub-category in Termination and one new sub-categories in SoftwareSystems 
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true (witness confirmed) A-o] 


unconfirmed (false, unknown, or ressources exhausted) | o | 
invalid (error in witness syntax) 0 


WITNESS VALIDATOR 


unknown 


VERIFIER false 
TASK 
VERIFIER 


unknown 


invalid (error in witness syntax) 0 


false 


unconfirmed (true, unknown, or ressources exhausted) 


false (witness confirmed) 


Fig.2: Visualization of the scoring schema for the reachability property (un- 
changed from 2021 [18]) 


WITNESS VALIDATOR 


(c) Tool-Info Module (d) Verifier Archive 


(b) Benchmark Definition 


(e) Verification Run 


UNKNOWN 


(a) Verification Task 


(f) Violation (f) Correctness 
Witness Witness 


Fig. 3: Benchmarking components of SV-COMP and competition's execution flow 
(same as for SV-COMP 2020) 


a correct verification result. Opt-out from Categories and Score Normalization 
for Meta Categories was done as described previously [11] (page 597). 


Reproducibility. SV-COMP results must be reproducible, and consequently, 
all major components are maintained in public version-control repositories. The 
overview of the components is provided in Fig. 3, and the details are given 
in Table 3. We refer to the SV-COMP 2016 report [14] for a description of all 
components of the SV-COMP organization. There are competition artifacts at 
Zenodo (see Table 4) to guarantee their long-term availability and immutability. 


Competition Workflow. The workflow of the competition is described in the re- 
port for Test-Comp 2021 [19] (SV-COMP and Test-Comp use a similar workflow). 
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Table 3: Publicly available components for reproducing SV-COMP 2022 


Component Fig. 3 Repository Version 
Verification Tasks (a) gitlab.com/sosy-lab/benchmarking/sv-benchmarks Svcomp22 
Benchmark Definitions (b) gitlab.com/sosy-lab/sv-comp/bench-defs svcomp22 
Tool-Info Modules (c) github. com/sosy-lab/benchexec 3.10 
Verifier Archives (d) gitlab.com/sosy-lab/sv-comp/archives-2022 svcomp22 
Benchmarking (e) github. com/sosy-lab/benchexec 3.10 
Witness Format (f) github. com/sosy-lab/sv-witnesses svcomp22 


Table 4: Artifacts published for SV-COMP 2022 


Content DOI Reference 


Verification Tasks 10.5281/zenodo.5831003 [22] 
Competition Results 10.5281/zenodo.5831008 [21] 
Verifiers and Validators 10.5281/zenodo.5959149 [24] 
Verification Witnesses  10.5281/zenodo.5838498 [23] 
BenchExec 10.5281/zenodo.5720267 [106] 


3 Reproducing a Verification Run and 
Trouble-Shooting Guide 


In the following we explain a few steps that are useful to reproduce individual 
results and for trouble shooting. It is written from the perspective of a participant. 


Step 1: Make Verifier Archive Available. The first action item for a partici- 
pant is to submit a merge request to the repository that contains all the verifier 
archives (see list of merge requests at GitLab). Typical problems include: 


e The fork is not public. This means that the continuous integration (CT) 
pipeline results are not visible and the merge request cannot be merged. 

e The shared runners are not enabled. This means that the CI pipeline cannot 
run and no results will be available. 

e Verifier does not provide a version string (and this should not include the 
verifier name itself). This means that it is not possible to later determine 
which version of the verifier was used for the experiments. Therefore, version 
strings are mandatory and are checked by the CI. 

e The interface between the execution (with BENcHExEC) and the verifica- 
tion tool can be checked using the procedure decribed in the BENCHEXEG 
documentation.! 


Step 2: Ensure That Verifier Works on Competition Machines. Once the 
CI checks passed and the archive is merged into the official competition repository, 
the verifier can be executed on the competition machines on a few verification 


1 https://github.com/sosy-lab/benchexec/blob/3.10/doc/tool-integration.md 
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tasks. The competition uses the infrastructure VERIFIERCLOUD, and remote 
execution in this compute cloud is possible using COVERITEAM [29]. CoVERITEAM 
is a tool for constructing cooperative verification tools from existing components, 
and the competition is supported by this project since SV-COMP 2021. Among 
its many capabilities, it enables remote execution of verification runs directly 
on the competition machines, which was found to be a valuable service for 
trouble shooting. A description and example invokation for each participating 
verifier is available in the CoVERrTEAM documentation (see file doc/competition- 
help.md in the CoVznrTEAM repository). Competition participants are asked 
to execute their tool locally using CoVEniTEAM and then remotely on the 
competition machines. Typical problems include: 


e Verifiers sometimes have insufficient log output, such that it is not possible 
to observe what the verifier was executing. The first step towards trouble 
shooting is always to ensure some minimal log output. 

e The verifier assumes software that is not installed yet. Each verifier states its 
dependencies in its documentation. For example, the verifier CBMc specifies 
under required-ubuntu-packages that is relies on the Ubuntu packages 
gcc and libc6-dev-i386 in file benchmark-defs/category-structure.yml in 
the repository with the benchmark definitions. This is easy to fix by adding 
the dependency in the definition file and get it installed. 

e The verifier makes assumptions about the hardware of the machine, e.g., 
selecting a specific processing unit. This can be investigated by running the 
verifier in the Docker container and remotely on the competition machines. 

e For the above-mentioned purpose, the competition offers a Docker image 
that can be used to try out if all required dependencies are available.? 

e The competition also provides a list of installed packages, which is important 
for ensuring reproducibility. 


Step 3: Check Prerun Results. So far, we considered executing individ- 
ual verification runs in the Docker container or remotely on the competition 
machines. As a service to the participating teams, the competition offers train- 
ing runs and provides the results to the teams. Typical checks that teams 
perform on the prerun results include: 


e Inspect the verification results (solution to the verification task, like TRUE, 
FALSE, UNKNOWN, etc.) and log files. 

e Inspect the validation results (was the verification result confirmed by a 
validator) and the produced verification witnesses. 

e Inspect the result of the witness linter. All witnesses should be syntactically 
correct according to the witness specification. 

e In case the verification result does not match the expected result, investigate 
the verifier and the verification task; in case of problems with the verification 
task, report to the jury by creating a merge request with a fix or an issue for 
discussion in the SV-Benchmarks repository. 


2 https://gitlab.com/sosy-lab/benchmarking/competition-scripts/-/tree/svcomp22 
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Table 5: Competition candidates with tool references and representing jury members; 
new for first-time participants, ^ for hors-concours participation 


Participant Ref. Jury member Affiliation 

2Ls 36,81] Viktor Malík BUT, Brno, Czechia 

APROVE 65,100] Jera Hensel RWTH Aachen, Germany 
BRICK 37] Lei Bu Nanjing U., China 

CBMC 75] Michael Tautschnig Queen Mary U. of London, UK 
CoasTAL? 102] (hors concours) - 

CVT-ArcoSzL"*"7 [29,30] (hors concours) = 

CVT-PanPonr"*"? [29,30] (hors concours) = 

CPA-BAM-BxB? [3,104] (hors concours) - 

CPA-BAM-SMG "*"* Anton Vasilyev ISP RAS, Russia 

CPACHECKER 31,49] Thomas Bunk LMU Munich, Germany 
CPALocxkator® 4, 5] (hors concours) - 

Crux" 52,96] Ryan Scott Galois, USA 

CSEQ 47,71] Emerson Sales Gran Sasso Science Institute, Italy 
DARTAGNAN 58,88] Hernán Ponce de León U. Bundeswehr Munich, Germany 
DEAGLE"*" 62] Fei He Tsinghua U., China 

Diving? 8, 76] (hors concours) = 

Enpr"*" Fatimah Aljaafari U. of Manchester, UK 
ESBMC-INCRÎ 43,46] (hors concours) = 

EsBMC-KIND 56,57] Rafael Sá Menezes U. of Manchester, UK 
FRAMA-C-SV 34,48] Martin Spiessl LMU Munich, Germany 
GAZER-THETAÎ 1, 60] (hors concours) zx 

GDanr^*" 84] Falk Howar TU Dortmund, Germany 
GOBLINT 95,103] Simmo Saan U. of Tartu, Estonia 
Graves-CPA "*W 79] Will Leeson U. of Virginia, USA 

InrER "e" 38,73] (hors concours) - 

JAVA-RANGER 98,99] Soha Hussein U. of Minnesota, USA 

JAYHORN 72,97] | Ali Shamakhi Tehran Inst. Adv. Studies, Iran 
JBMC 44,45] Peter Schrammel U. of Sussex / Diffblue, UK 
JDART 80,83] Falk Howar TU Dortmund, Germany 

Korn 55] Gidon Ernst LMU Munich, Germany 
Larp’ 77,78] Henrich Lauko Masaryk U., Brno, Czechia 
Lazy-CSEQ” 69,70] (hors concours) - 

LocksMrTH ^** 90] Vesal Vojdani U. of Tartu, Estonia 

PESCo 93,94] Cedric Richter U. of Oldenburg, Germany 
PiNAKA? 41] (hors concours) - 

PnEDATORHP? 66,87] (hors concours) = 

Sasi" Xie Li Academy of Sciences, China 
Smack 61,92] (hors concours) = 

Spr? 85,91] (hors concours) - 

SYMBIOTIC 39,40] Marek Chalupa Masaryk U., Brno, Czechia 
THETAP*" 101,109] Vince Molnár BME Budapest, Hungary 
UAUTOMIZER 63,64] Matthias Heizmann U. of Freiburg, Germany 
UGEMCUTTER "** 74] Dominik Klumpp U. of Freiburg, Germany 
UKoJAK 54,86] Frank Schüssele U. of Freiburg, Germany 
UTAIPAN 51,59] Daniel Dietsch U. of Freiburg, Germany 
VERIABS 2,50] Priyanka Darke Tata Consultancy Services, India 
VERIFUzZ 42,82]  Raveendra Kumar M. Tata Consultancy Services, India 
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'able 6: Algorithms and techniques that the participating verification systems used; 
new for first-time participants, ^ for hors-concours participation 


X 89 = > £ 
S c 9 X ES E E Š 5 * c 
5S0 fL E E o A- 
i23 Tagi P2221 
Hee eee ee OR 
4 F3 53 EM m ki & Ed 
BL hZ uga Sn ARa SaS Eo 
sM E AH AEA Fac Ges eee 
ok- AHE- G- G EHEHE- UGA HBH nE 
mY®soo-—~ Pf F< stGenerNPse Son SHG 
Verifier On domocxnauizuouom«a«xlicsa«ocdciiza 
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CoAsTAL? V 
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CPA-BAM-BNB? Vv y VSSA 
CPA-BAM-SMG "*" 
CPALockATOR? Jv V LSAS V 
CPACHECKER VVA AIS Vv LSAS VARA Vv 
Crux” r4 
CSEQ v v v 
DARTAGNAN v v v 
DEAGLE"*" 
DiviNE? "4 V P4 "4 Vv 
Enr"^" 4 
ESBMC-INCR? Jv P4 "4 
ESBMC-KIND vv v v v 
FRAMA-C-SV v 
GAZER-THETA? Vv v y VSS "A 
GDanr"*" v r4 V 
GOBLINT P4 P4 
GRAVES-CPA "*" vv Vv Vv LSAS Vv V 
INFER"? VSS "A 
JAVA-RANGER v v 
JAYHORN Vv "4 y Vv 
JBMC v v v 
JDART P4 r4 V 
KORN V/v r4 V 
Larr"™ v v v V 
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LOCKSMITH "*" v 
PESCO VARA V Vv V/V vv Vv 
PiNAKAP Vv y 
PREDATORHP? "A 
Sps” "4 "4 
SMACK?” "A V V "4 
Spr? "A J "4 
SYMBIOTIC P4 r4 Vv v v "4 
'U'HETA "5" vv v vv "4 "4 Vv 
UAUTOMIZER Vv v V^v7wvv^.Y Vv 
UGEMCUTTER "*" Vv v Vwv^Y Vv 
UKOJAK Vv v Vv 
UTAIPAN Vv Vv v V^ Vv 
VERIABS v Vv Vv Vv 
VERIFUZZ "4 "4 v 


4 Participating Verifiers 


The participating verification systems are listed in Table 5. The table contains 
the verifier name (with hyperlink), references to papers that describe the systems, 
the representing jury member and the affiliation. The listing is also available on 
the competition web site at https: //sv-comp.sosy-lab. org/2022/systems.php. Table 6 
lists the algorithms and techniques that are used by the verification tools, and 
Table 7 gives an overview of commonly used solver libraries and frameworks. 


Hors-Concours Participation. There are verification tools that participated 
in the comparative evaluation, but did not participate in the rankings. We call 
this kind of participation hors concours as these participants cannot participate 
in rankings and cannot “win” the competition. Those are either passive or active 
participants. Passive participation means that the tools are taken from previous 
years of the competition, in order to show progress and compare new tools against 
them (CoasrAL^, CPA-BAM-BNB^, CPALocKATOR", DivINE?, EsBMC-INCR?, 
GazgR-TuETA?, LAzv-CSEQ?, PiNAKA?, PREDATORHP”, Smack”, Spr). Active 
paritication means that there are teams actively developing the tools, but there 
are reasons why those tools should not occur in the rankings. For example, a 
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'Table 7: Solver libraries and frameworks that are used as components in the participating 
verification systems (component is mentioned if used more than three times; "^" for 
first-time participants, ^ for hors-concours participation 
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tool might use other tools that participate in the competition on their own, 
and comparing such a tool in the ranking could be considered unfair (CVT- 
ALGOSEL'*"?, CVT-PARPorT”™?). Also, a tool might produce uncertain results 
and the team was not sure if the full potential of the tool can be shown in the 
SV-COMP experiments (INrEn"*"7). Those participations are marked as ‘hors 
concours’ in Table 5 and others, and the names are annotated with a symbol (). 


5 Results and Discussion 


'The results of the competition represent the the state of the art of what can 
be achieved with fully automatic software-verification tools on the given bench- 
mark set. We report the effectiveness (number of verification tasks that can 
be solved and correctness of the results, as accumulated in the score) and 
the efficiency (resource consumption in terms of CPU time and CPU energy). 
'The results are presented in the same way as in last years, such that the im- 
provements compared to last year are easy to identify, except that due to the 
number of tools, we have to split the table and put the hors-concours verifiers 
into a second results table. The results presented in this report were inspected 
and approved by the participating teams. 


Computing Resources. The resource limits were the same as in the previous 
competitions [14]: Each verification run was limited to 8 processing units (cores), 
15 GB of memory, and 15min of CPU time. Witness validation was limited 
to 2 processing units, 7 GB of memory, and 1.5 min of CPU time for violation 
witnesses and 15 min of CPU time for correctness witnesses. The machines 
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'Table 8: Quantitative overview over all regular results; empty cells are used for opt-outs, 
" for first-time participants 
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'Table 9: Quantitative overview over all hors-concours results; empty cells represent 
opt-outs, "*" for first-time participants, ^ for hors-concours participation 
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Spr? 430 


for running the experiments are part of a compute cluster that consists of 
167 machines; each verification run was executed on an otherwise completely 
unloaded, dedicated machine, in order to achieve precise measurements. Each 
machine had one Intel Xeon E3-1230 v5 CPU, with 8 processing units each, 
a frequency of 3.4 GHz, 33 GB of RAM, and a GNU/Linux operating system 
(x86 64-linux, Ubuntu 20.04 with Linux kernel 5.4). We used BENCHExEc [32] 
to measure and control computing resources (CPU time, memory, CPU energy) 
and VERIFIERCLOuD to distribute, install, run, and clean-up verification runs, 
and to collect the results. The values for time and energy are accumulated 
over all cores of the CPU. To measure the CPU energy, we used CPU ENERGY 
METER [35] (integrated in BENcuExzc [32]). 


One complete verification execution of the competition consisted of 309 081 ver- 
ification runs (each verifier on each verification task of the selected categories 
according to the opt-outs), consuming 937 days of CPU time and 249 kWh 
of CPU energy (without validation). Witness-based result validation required 
1.43 million validation runs (each validator on each verification task for categories 
with witness validation, and for each verifier), consuming 708 days of CPU time. 
Each tool was executed several times, in order to make sure no installation issues 
occur during the execution. Including preruns, the infrastructure managed a 
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Table 10: Overview of the top-three verifiers for each category; "^" for first-time 
participants, measurements for CPU time and energy rounded to two significant digits 
(‘~ indicates a missing energy value due to a configuration bug) 


Rank Verifier Score CPU CPU Solved Unconf. False Wrong 
Time Energy Tasks Tasks Alarms Proofs 
(inh) (in kWh) 


ReachSafety 

1 VERIABS 6923 170 1.8 4117 359 

2 CPACHECKER 5572 130 1.5 3245 228 4 
3 PESCo 5080 63 0.57 3033 314 7 
MemSafety 

1 SYMBIOTIC 4051 2.6 0.034 2167 1097 

2 CPA-BAM-SMG"*" 3101 7.3 0.064 2975 17 

3 CPACHECKER 3057 7.8 0.069 3119 0 
ConcurrencySafety 

1 DEAGLE™™ 757 0.50 0.0059 517 42 

2 CSEQ 655 5.1 0.059 454 50 

3 UGeEemCurtTEeR"™ 612 4.9 = 445 21 
NoOverfiows 

1 CPACHECKER 531 1.2 0.012 366 3 

2 UAUTOMIZER 506 2.0 0.019 356 2 

3 U'TAIPAN 501 2.2 0.023 355 1 
Termination 

1 UAUTOMIZER 2552 13 0.12 1581 8 

2 APROVE 2305 38 0.43 1114 37 

3 21s 2178 2.9 0.025 1163 203 
SoftwareSystems 

1 SYMBIOTIC 2704 1.2 0.016 1188 73 

2 CPACHECKER 809 52 0.60 1 660 169 1 
3 Graves-CPA "w 802 19 0.17 1582 95 2 3 
FalsificationOverall 

1 CPACHECKER 3835 81 0.90 3626 95 5 
2 PESCo 3683 45 0.41 3552 110 9 
3 SYMBIOTIC 3274 14 0.18 2295 1191 3 
Overall 

1 SYMBIOTIC 12249 34 0.44 7 430 1529 3 
2 CPACHECKER 11904 210 2.3 9773 408 14 
3 UAUTOMIZER 11802 170 1.7 7 948 311 2 2 
JavaOverall 

1 JDART 714 1:2 0.012 522 0 

2 JBMC 700 0.42 0.0039 506 0 


3 JAVA-RANGER 670 4.4 0.052 466 0 
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Fig. 4: Quantile functions for category C-Overall. Each quantile function illustrates 
the quantile (x-coordinate) of the scores obtained by correct verification runs 
below a certain run time (y-coordinate). More details were given previously [11]. 
A logarithmic scale is used for the time range from 1s to 1000s, and a linear 
scale is used for the time range between 0s and 1s. 


total of 2.85 million verification runs consuming 19 years of CPU time, and 16.3 
million validation runs consuming 11 years of CPU time. 


Quantitative Results. Tables 8 and 9 present the quantitative overview of 
all tools and all categories. Due to the large number of tools, we need to split 
the presentation into two tables, one for the verifiers that participate in the 
rankings (Table 8), and one for the hors-concours verifiers (Table 9). The head 
row mentions the category, the maximal score for the category, and the number 
of verification tasks. The tools are listed in alphabetical order; every table 
row lists the scores of one verifier. We indicate the top three candidates by 
formatting their scores in bold face and in larger font size. An empty table cell 
means that the verifier opted-out from the respective main category (perhaps 
participating in subcategories only, restricting the evaluation to a specific topic). 
More information (including interactive tables, quantile plots for every category, 
and also the raw data in XML format) is available on the competition web site 
(https: //sv-comp.sosy-lab.org/2022/results) and in the results artifact (see Table 4). 


Table 10 reports the top three verifiers for each category. The run time (column 
‘CPU Time’) and energy (column ‘CPU Energy’) refer to successfully solved 
verification tasks (column ‘Solved Tasks’). We also report the number of tasks for 
which no witness validator was able to confirm the result (column ‘Unconf. Tasks’). 
The columns ‘False Alarms’ and ‘Wrong Proofs’ report the number of verification 
tasks for which the verifier reported wrong results, i.e., reporting a counterexample 
when the property holds (incorrect FALSE) and claiming that the program fulfills 
the property although it actually contains a bug (incorrect TRUE), respectively. 
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Table 11: Results of verifiers in demonstration category NoDataRace 


Verifier Score Correct true Correct false Incorrect true Incorrect false 
CSEQ 39 37 61 0 6 
DARTAGNAN —299 47 23 13 0 
GoBLINT 124 62 0 0 0 
LocksmITH"*” 34 17 0 0 0 
UAUTOMIZER 120 49 54 1 0 
UGrMCurrER"" 151 57 69 1 0 
UKOJAK 0 0 0 0 0 
U'TAIPAN 139 56 59 1 0 


Score-Based Quantile Functions for Quality Assessment. We use score- 
based quantile functions [11,32] because these visualizations make it easier 
to understand the results of the comparative evaluation. The results archive 
(see Table 4) and the web site (nttps://sv-comp.sosy-1ab.org/2022/results) include 
such a plot for each (sub-)category. As an example, we show the plot for category 
C-Overall (all verification tasks) in Fig. 4. A total of 13 verifiers participated in 
category C-Overall, for which the quantile plot shows the overall performance over 
all categories (scores for meta categories are normalized [11]). A more detailed 
discussion of score-based quantile plots, including examples of what insights one 
can obtain from the plots, is provided in previous competition reports [11, 14]. 

The winner of the competition, SYMBIOTIC, not only achieves the best cum- 
mulative score (graph for SymBroric has the longest width from x = 0 to its right 
end), but is also extremely efficient (area below the graph is very small). Verifiers 
whose graphs start with a negative commulative score produced wrong results. 
Several verifiers whose graphs start with a minimal CPU time larger than 3s 
are based on Java and the time is consumed by starting the JVM. 


Demo Category NoDataRace. SV-COMP 2022 had a new category on 
data-race detection and we report the results in Table 11. The benchmark 
set contained a total of 162 verification tasks. The category was defined as 
a demonstration category because it was not clear how many verifiers would 
participate. Eight verifiers specified the execution for this sub-category in their 
benchmark definition ? and participated in this demonstration. A detailed table 
was generated by BENCHExEc’s table-generator together with all other results as 
well and is available on the competition web site and in the artifact (see Table 4). 

'The results are presented as a means to show that such a category is useful; 
the results do not represent the full potential of the verifiers, as they were not 
fully tuned by their developers but handed in for demonstrating abilities only. 


Alternative Rankings. The community suggested to report a couple of alterna- 
tive rankings that honor different aspects of the verification process as complement 
to the official SV-COMP ranking. Table 12 is similar to Table 10, but contains 


3 https://gitlab.com/sosy-lab/sv-comp/bench-defs/-/tree/svcomp22/benchmark-defs 
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Table 12: Alternative rankings for catagory Overall; quality is given in score 
points (sp), CPU time in hours (h), kilo-watt-hours (KWh), wrong results in 
errors (E), rank measures in errors per score point (E/sp), joule per score point 
(J/sp), and score points (sp) 


Rank Verifier Quality CPU CPU Solved Wrong Rank 
Time Energy Tasks Results Measure 


(sp (h) (kWh) (E) 
Correct Verifiers (E/sp) 
1 GoBLINT 1951 49 0.070 1574 0 0 
2 UKoJAK 5078 66 0.71 3988 1 .00020 
3 SYMBIOTIC 12 249 34 0.44 7 430 3 .00024 
worst (with pos. score) 282 .042 
Green Verifiers (J/sp) 
1 GOBLINT 1951 4.9 0.070 1574 0 120 
2 SYMBIOTIC 12 249 34 0.44 7 430 3 130 
3 CBMC 6733 25 0.27 6479 282 140 
worst (with pos. score) 690 


2 


the alternative ranking categories Correct and Green Verifiers. Column ‘Quality 
gives the score in score points, column ‘CPU Time’ the CPU usage of successful 
runs in hours, column ‘CPU Energy’ the CPU usage of successful runs in kWh, 
column ‘Solved Tasks’ the number of correct results, column ‘Wrong Results’ 
the sum of false alarms and wrong proofs in number of errors, and column 
‘Rank Measure’ gives the measure to determine the alternative rank. 


Correct Verifiers — Low Failure Rate. The right-most columns of Table 10 
report that the verifiers achieve a high degree of correctness (all top three 
verifiers in the C-Overall have less than 2%o wrong results). The winners of 
category Java-Overall produced not a single wrong answer. The first category in 
Table 12 uses a failure rate as rank measure: number of ingorect results the number 
of errors per score point (E/sp). We use E as unit for number of incorrect results 
and sp as unit for total score. The worst result was 0.023 E/sp in SV-COMP 2021 


and is now at 0.042 E/sp. GonriNT is the best verifier regarding this measure. 


Green Verifiers — Low Energy Consumption. Since a large part of the cost of 
verification is given by the energy consumption, it might be important to also 
consider the energy efficiency. The second category in Table 12 uses the energy 


E E . total CPU energy ; E 
consumption per score point as rank measure: misxftotdl score; 1)? with the unit J/sp. 


The worst result from SV-COMP 2021 was 630 J/sp and is now at 690 J/sp. Also 
here, GoBLINT is the best verifier regarding this measure. 


New Verifiers. To acknowledge the verification systems that participate for 
the first or second time in SV-COMP, Table 13 lists the new verifiers (in 
SV-COMP 2021 or SV-COMP 2022). 
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'Table 13: New verifiers in SV-COMP 2021 and SV-COMP 2022; column 'Sub- 
categories' gives the number of executed categories (including demo category 
NoDataRace), "*" for first-time participants, ^ for hors-concours participation 


Verifier Language First Year Sub-categories 
CVT-ALcoSEL"” 2 C 2022 18 
CV T-PanPonr"*"7 C 2022 35 
CPA-BAM-SMG "** C 2022 16 
Cigux new C 2022 20 
DrAarE"*" C 2022 1 
Enr^ew C 2022 1 
GRnavEs-CPA ^*" C 2022 35 
INFER" WI C 2022 25 
Larr™” C 2022 22 
LockswrrH ^** C 2022 1 
SES" C 2022 6 
THETA "** C 2022 13 
UGzMCuTTER"*" C 2022 2 
FRAMA-C-SV C 2021 4 
GazER-TuETA? C 2021 9 
GOBLINT C 2021 25 
Korn C 2021 4 
GDaznrP^ew Java 2022 1 


Table 14: Confirmation rate of verification witnesses during the evaluation in 
SV-COMP 2022; "*" for first-time participants, ? for hors-concours participation 


Result TRUE FALSE 
Total Confirmed Unconf. Total Confirmed Unconf. 

2Ls 2394 2388 99.7% 6 1648 1363 82.7% 285 
CBMC 3837 3493 91.0 96 344 3536 2986 84.4 96 550 
CVT-PanPonr""7^ 7440 7083 95.2 96 357 4754 4332 91.1% 422 
CPACHECKER 6006 5701 94.9% 305 4175 4072 97.5% 103 
Divine” 1692 1672 98.8 96 20 1040 870 83.7% 170 
EsBMC-KIND 5542 5483 98.9 96 59 3034 2556 84.2 96 478 
GOBLINT 1657 1574 95.0 96 83 0 0 

Graves-CPA P*" 5651 5458 96.6 % 193 3723 3576 96.1% 147 
PESCO 6155 5734 93.2 96 421 4116 3934 95.6 96 182 
SYMBIOTIC 4878 4798 98.4 % 80 4081 2632 64.5% 1449 
UAUTOMIZER 5751 5591 97.2 96 160 2508 2357 94.0 96 151 
UKoJAK 2875 2863 99.6 96 12 1144 1125 98.3% 19 


U'TAIPAN 4567 4513 98.8% 54 1719 1576 91.796 143 


Verifiable Witnesses. Results validation is of primary importance in the 
competition. All SV-COMP verifiers are required to justify the result (TRUE 
or FALSE) by producing a verification witness (except for those categories for 
which no result validator is available). We used ten independently developed 
witness-based result validators and one witness linter (see Table 1). 
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Fig. 5: Number of evaluated verifiers for each year (first-time participants on top) 


Table 14 shows the confirmed versus unconfirmed results: the first column 
lists the verifiers of category C-Overall, the three columns for result ‘TRUE reports 
the total, confirmed, and unconfirmed number of verification tasks for which the 
verifier answered with TRUE, respectively, and the three columns for result FALSE 
reports the total, confirmed, and unconfirmed number of verification tasks for 
which the verifier answered with FALSE, respectively. More information (for all 
verifiers) is given in the detailed tables on the competition web site and in the 
results artifact; all verification witnesses are also contained in the witnesses 
artifact (see Table 4). The verifiers 20s and UKo»ax are the winners in terms of 
confirmed results for expected results T'RUE and FALSE, respectively. The overall 
interpretation is similar to SV-COMP 2020 and 2021 [17, 18]. 


6 Conclusion 


The 11th edition of the Competition on Software Verification (SV-COMP 2022) 
was the largest ever, with 47 participating verification systems (incl. 14 hors- 
concours and 14 new verifiers) (see Fig. 5 for the participation numbers and 
Table 5 for the details). The number of result validators was increased from 6 in 
2021 to 11 in 2022, to validate the results (Table 1). The number of verification 
tasks was increased to 15 648 in the C category and to 586 in the Java category, 
and a new category on data-race detection was demonstrated. A new section 
in this report (Sect. 3) explains steps to reproduce verification results and to 
investigate problems during execution, and a new table tried to give an overview of 
the usage of common solver libraries and frameworks. The high quality standards 
of the TACAS conference, in particular with respect to the important principles 
of fairness, community support, and transparency are ensured by a competition 
jury in which each participating team had a member. We hope that the broad 
overview of verification tools stimulates their further application by an ever 
growing user community of formal methods. 


Data-Availability Statement. The verification tasks and results of the compe- 
tition are published at Zenodo, as described in Table 4. All components and data 
that are necessary for reproducing the competition are available in public version 
repositories, as specified in Table 3. For easy access, the results are presented 
also online on the competition web site https://sv-comp.sosy-1ab.org/2022/results. 
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Abstract. To (dis)prove termination of C programs, AProVE uses sym- 
bolic execution to transform the program's LLVM code into an integer 
transition system, which is then analyzed by several backends. The trans- 
formation steps in AProVE and the tools in the backend only produce 
sub-proofs in their domains. Hence, we now developed new techniques 
to automatically combine the essence of these proofs. If non-termination 
is proved, then they yield an overall witness, which identifies a non- 
terminating path in the original C program. 


1 Verification Approach and Software Architecture 


To prove (non-)termination of a C program, AProVE uses the Clang compiler [7] 
to translate it to the intermediate representation of the LLVM framework [15]. 
Then AProVE symbolically executes the LLVM program and uses abstraction to 
obtain a finite symbolic execution graph (SEG) containing all possible program 
runs. We refer to [14,17] for further details on our approach to prove termination. 

'To prove non-termination, AProVE runs three approaches in parallel, see Fig. 
1. The first two approaches transform the lassos of the SEG to integer transition 
systems (ITSs), which are then passed to the tools T2 [6] and LoAT [11]. If one 
of the tools returns a proof of non-termination, AProVE uses it to construct a 
non-terminating path through the C program. The path of the first succeed- 
ing approach is returned to the user while all other computations are stopped. 
T2's proof consists of a recurrent set characterizing those variable assignments 
that lead to a non-terminating ITS run. Here, AProVE uses an SMT solver to 
identify a corresponding concrete assignment of the variables in the ITS (which 
correspond to the variables in the (abstract) program states of the SEG). The 
third approach transforms the lassos of the SEG directly to SMT formulas which 
are only satisfiable if there is a non-terminating path, and in this case, we can 
deduce a variable assignment from the model of the formulas returned by the 
solver. While the first and the third approach were already available in AProVE 
before [13], we now extended them by the generation of non-termination wit- 
nesses. To this end, the variable assignment obtained from these approaches 
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Fig. 1: AProVE’s Workflow for Non-Termination Analysis 


is used by AProVE to step through the corresponding lasso of the SEG in or- 


der to obtain a concrete execution path which 


witnesses non-termination. To 


ensure that the generation of the path terminates, AProVE stops as soon as a 
program state of the SEG is visited twice. Thus, this approach only succeeds if 
the first loop on the path whose body is executed several times is already the 
non-terminating loop. However, it does not find non-termination witnesses for 
programs with several loops, where the non-terminating path first leads through 
several iterations of other loops before it ends in a non-terminating loop. 


To handle such programs as well, we now 
developed a novel second approach for prov- 
ing non-termination which uses our tool LoAT 
in the backend. To understand how LoAT finds 
non-termination proofs, consider the function f 
in Fig. 2. The first loop decrements x as long 
as x is positive and increments y by the 
same amount. Afterwards, the second loop 
does not terminate if y is greater than 1. 
Hence, the function f does not terminate 


void f(x,y) i 
y = 0; 
while 


(x > 0) { 
x-1; 
yti; 


(y > 1) 
ys 


Fig. 2: Example C Function 


if the initial value of the parameter x is greater than 1. LoAT can 
detect such coherences in the corresponding ITS (Fig.3a) generated by 


AProVE. To this end, LoAT uses different 


Finite acceleration combines several itera- 


tions of a looping rule into a new rule. LoAT  "? 
applies this simplification to the rule rı rep- 71 
resenting the first loop, resulting in the new r2 
rule r4 in Fig. 3b. In the second looping rule rg 


r3, the guard is invariant w.r.t. the update 
of the variables in this rule. In such a case, 
LoAT applies non-terminating acceleration, 


transforming r3 to r5. Finally, chaining al- il 
lows to represent the successive execution of "5 
two rules. For example, the rule rg is the "6 
result of chaining ro and r4. The exact sim- r7 
plification steps performed by LoAT in this rg 


example are shown in Fig. 3c. Note that the 
final rule rg starts from the initial function 


forms of loop acceleration: 


: f(x,y) — &(x,0) 

: L(x, y) > Aa(z—1,y-1) [x > 0] 
: Li (x,y) > £(x, y) [x < 0] 
: a(x, y) — £(z, y) [y > 1] 


Fig. 3a: Corresponding ITS 


: L(x, y) > 4 (0, y + x) [x > 0] 
: La(x, y) — oo [y > 1] 
: f(x,y) — A(0,x) [x > 0) 
: f(x,y) — (0,2) [x > 0) 
: f(x,y) — oo [x > 1] 


Fig. 3b: Simplified Rules 
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symbol and directly goes to non-termination. Every variable assignment satisfy- 
ing the respective final guard x > 1 results in a non-terminating run. 
The simplification tree in Fig. 3c is also the 


(o) (1) (r2) (73j 
starting point for our new technique to generate Em rdi donstoahinesing 
non-termination witnesses. AProVE constructs aaa: “aa ip iu 
this tree from LoAT's proof output. Then, by  -eipins 
processing the leaves of the simplification tree (re: chainifig 
from left to right, a path through the SEG can iG disi 
be derived. To determine how often one has to Xs) 


traverse earlier loops on the path to the non- 
terminating loop, AProVE uses an SMT solver 
to find a concrete variable assignment that satisfies the final guard. In our ex- 
ample, the final guard x > 1 would be satisfied by {a = 2, y = 0}, for example. 
Consequently, the corresponding concrete execution path includes two iterations 
of the first loop before reaching the non-terminating second loop. 

Once the path is constructed, AProVE extracts the LLVM program positions 
from the states, obtaining a non-terminating path through the LLVM program in 
form of a lasso. Using the Clang debug information output, AProVE then matches 
the LLVM lines to the lines in the C program. The resulting C witness can be 
validated by the tools CPAchecker [5] and Ultimate Automizer [12]. 


Fig. 3c: Simplification Tree 


2 Discussion of Strengths and Weaknesses 


In general, AProVE is especially powerful on programs where a precise modeling 
of the values of program variables and memory contents is needed to (dis)prove 
termination. However, on large programs containing many variables which are 
not relevant for termination, tools with CEGAR-based approaches are often 
faster. The reason is that AProVE does not implement any techniques to decide 
which variables are relevant for (non-)termination. 

Furthermore, one of AProVE's most crucial weaknesses when proving non- 
termination in past editions of SV-COMP was to produce a meaningful witness. 
Therefore, in the two approaches for proving non-termination in AProVE that 
are based on T2 or on the direct analysis of lassos of the SEG, we added the 
novel techniques presented in the current paper to generate non-termination 
witnesses from the obtained variable assignments. Here, the problem is that 
when computing a concrete execution path, we cannot be sure when to stop the 
computation: Whenever we visit a program position repeatedly, we do not know 
if this position is part of the non-terminating loop of the lasso, or if it is still 
part of the finite path to the non-terminating loop. 

In contrast, in our new approach based on LoAT, the simplification tree al- 
lows us to infer the order in which the loops of the program are traversed and 
this tree also contains the information which loop is the non-terminating one. 
Thus, this approach extends AProVE's power substantially, since it can find 
non-termination witnesses for programs where all non-terminating paths lead 
through several iterations of more than one loop. On the other hand, there are 
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also examples where the other two approaches outperform the approach based on 
LoAT, e.g., if T2 finds a non-termination proof and LoAT does not. Our observa- 
tion is that especially for small programs containing only a single loop, the other 
approaches are often faster. This is also confirmed by our results in the Termina- 
tion category of SV-COMP 2022: While in the sub-categories MainControlFlow 
and MainHeap, 8396 of the non-termination proofs are found using T2 or the 
direct SMT approach, in Termination- Other, 95% of the non-termination proofs 
result from the LoAT approach. This set consists of especially large programs, 
which often contain more than one loop. 

More information about SV-COMP 2022 including the competition results 
can be found in the competition report [3]. 


3 Setup and Configuration 


AProVE is developed in the * Programming Languages and Verification" group 
headed by J. Giesl at RWTH Aachen University. On the web site [2], AProVE 
can be downloaded or accessed via a web interface. Moreover, [2] also contains a 
list of external tools used by AProVE and a list of present and past contributors. 

In SV-COMP 2022, AProVE only participates in the category “ Termination”. 
AII files from the submitted archive must be extracted into one folder. AProVE 
is implemented in Java and needs a Java 11 Runtime Environment. Moreover, 
AProVE requires the Clang compiler [7] to translate C to LLVM. To analyze the 
resulting ITSs in the backend, AProVE uses LoAT [11] and T2 [6]. Furthermore, 
it applies the satisfiability checkers Z3 [8], Yices [9], and MiniSAT [10] in parallel 
(our archive contains all these tools). As a dependency of T2, Mono [16] (version 
> 4.0) needs to be installed. Extending the path environment is necessary so that 
AProVE can find these programs. Using the wrapper script aprove.py in the 
BenchExec repository, AProVE can be invoked, e.g., on the benchmarks defined 
in aprove.xml in the SV-COMP repository. The most recent version of AProVE 
with the improved witness generation can be downloaded at [1]. 


Data Availability Statement. All data of SV-COMP 2022 are archived as described 
in the competition report [3] and available on the competition web site. This includes 
the verification tasks, results, witnesses, scripts, and instructions for reproduction. 
'The version of our verifier as used in the competition is archived together with other 
participating tools [4]. 
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Abstract. BRICK is a bounded reachability checker for embedded C 
programs. BRICK conducts a path-oriented style checking of the bounded 
state space of the program, that enumerates and checks all the possi- 
ble paths of the program in the threshold one by one. To alleviate the 
path explosion problem, BRICK locates and records unsatisfiable core 
path segments during the checking of each path and uses them to prune 
the search space. Furthermore, derivative free optimization based falsi- 
fication and loop induction are introduced to handle complex program 
features like nonlinear path conditions and loops efficiently. 


1 Verification Approach 


Existing bounded software checkers usually encode the bounded state space of 
the program into one constraint solving problem directly. However, in this man- 
ner, when the size of the program or the bound of the checking increases, the 
corresponding constraint solving problem explodes quickly and becomes difficult 
to solve by existing SAT/SMT solvers. 

To solve this problem, BRICK conducts a path-oriented style checking of the 
bounded state space of the program, that enumerates and checks all the possible 
paths in the threshold one by one [1,2]. The main merit of the approach is that, 
in this case, the size of the problem needs to be solved by the constraint solver is 
well controlled and can be easily handled. The main features of BRICK’s solving 
are reported below: 


1.1 Flexible Path Enumeration 


BRICK enumerates potential paths from the control flow graph (CFG) of the 
given program to the user-defined step bound. Two path enumeration strategies 
are applied in BRICK, each with its own advantages. 
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of China (No.62172200, No.61632015). 


© The Author(s) 2022 
D. Fisman and G. Rosu (Eds.): TACAS 2022, LNCS 13244, pp. 408-412, 2022. 
https://doi.org/10.1007/978-3-030-99527-0_22 


BRICK: Path Enumeration Based Bounded Reachability Checking 409 


First, we can simply conduct classical Depth-first-search (DFS) to enumerate 
program paths. The benefit of this approach is that, if the DFS stops without 
touching the given bound, we can get a result that the target state is not reach- 
able in general, not only in the bounded state space. 

We have also implemented a special method to encode the jump-to relation 
between different code blocks into an SAT formula and obtain the potential 
path by SAT solving. The benefit is that if the potential path is confirmed to 
be infeasible by following path condition solving, the infeasible path segment in 
the path can be located and encoded back to the SAT formula to prune all the 
future paths containing such infeasible segment. 


1.2 Infeasible Path Segment Pool Guided State Space Pruning 


BRICK conducts the lazy solving of the path by encoding the path condi- 
tion of the potential path into a feasibility problem. BRICK asks a constraint 
solver, i.e. SMT solver (Z3 [6]), interval analysis (dReal [4]), and derivative-free 
optimization-based solving (Section 1.3), to solve the problem. If the path is de- 
cided to be infeasible by the solver, BRICK tries to extract the unsatisfiable core 
(UC) of the feasibility problem of this path, and maps the UC constraints to a 
infeasible path segment in the path, which will be added to infeasible paths pool. 
After that, all the paths that contain any infeasible path in the infeasible path 
pool will be reported as unreachable directly in the following path enumeration. 


1.3 Derivative-free Optimization Based Constraint Falsification 


We can see that constraint solving plays an important role in BRICK. However, 
complex path conditions, like nonlinear constraints, which widely appear in pro- 
grams, are hard to be handled efficiently by the existing solvers. In BRICK, a 
classification model-based derivative-free-optimization (DFO) approach is used 
to alleviate this difficult situation by conducting a sample-feedback-learn style 
DFO solving [8]. 

More specifically, the underlying solver guesses sample solution for the feasi- 
bility problem. Then, we evaluate whether the sampled solution can satisfy the 
path constraint or not, and calculate the distance between the sampled solution 
and the correct one if the sampled one does not satisfy the path constraint. Such 
distance will be used as the metric of feedback in the classification-based DFO 
learning, to guide the solver to converge to the value that fits the path con- 
straint. In practice, this approach works very well in nonlinear problem solving. 
However, this DFO-based approach can not tell the target is not reachable, if it 
fails to find a solution. 


1.4 Induction-based Loop Handling 


If the target program contains loops, the number of potential paths may explode. 
'To alleviate this problem, we conduct an induction-based proof to try to handle 
the loop before we start to do the BMC. 
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First of all, we collect the constraints from the assertions and generate the 
weakest precondition respectively. Then, we conduct the normal induction-based 
proof to see whether such constraints are satisfied in any iteration. If no coun- 
terexamples are returned, we know that the assertions won't be violated in the 
loop. Furthermore, we are also working on the integration of loop invariant gen- 
eration to further refine the CFG under checking. 


2 Software Architecture 


The architecture of BRICK is shown in Fig.1. It consists of a loop processing 
module, a path enumerating module, and a constraint solving module, all im- 
plemented in C4-4- language. 

In the loop processing module, if the program contains assertion-related loop, 
BRICK conducts loop induction-based verification firstly. If the induction works, 
BRICK reports unreachable; otherwise, it builds the program CFG, and performs 
the following path enumeration based checking. 

In the path enumerating module, BRICK employs SAT-based and DFS-based 
path enumerating methods to extract the program path and its corresponding 
path condition. The constraint solving module accepts the path condition and 
performs constraint solving accordingly. All the techniques used has been men- 
tioned in Section 1 respectively. The solvers used in BRICK including SAT solver 
MiniSAT [3], SMT solver Z3 [6], interval analysis solver dReal [4], and our im- 
plementation of the DFO method RACOS [9]. 


j ms | 
Minisat | Gening unsat oore | 


* 
Induction-based Program CFG |_, / Program [=$) SAT-based Path 
CFG 


Proving Building | 7 cre / | | "| Enumerating 


DFS-based Path 
Enumerating 


Fig. 1. Architecture of BRICK 


3 Strengths and Weaknesses 


Most of the bounded reachability checkers, i.e. CBMC [5], encode the bounded 
state space to a huge SMT formula consisting of both conjunction and disjunc- 
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tion of different kinds of formulas, which are difficult for the existing solvers to 
handle and may cause memory explosion easily. Instead, BRICK conducts the 
verification in a path-oriented way: 


— BRICK enumerates and checks all the potential paths one by one. In this 
manner, the computation complexity is well controlled. 

— Meanwhile, as only the ongoing path is saved in the memory and the cor- 
responding path constraints of the path will be disjunction-free, the solving 
problem is much easier to handle. 

— For the sake of processing capability, UC-guided backtracking and path prun- 
ing is proposed to prune the search space substantially, and DFO-based solv- 
ing is conducted to handle complex nonlinear constraints efficiently. 


BRICK had participated in the ReachSafety /Floats category of SV-COMP 2022 
[10]. BRICK has successfully verified 439 of all the 469 tasks, ranked 1st in this 
sub-category. Furthermore, we can see that for these 439 solved cases, BRICK 
only used 1000 seconds in total. On the other hand, CoveriTeam and VeriAbs [7] 
which won the 2nd and 3rd place in this category spent 9300 and 18000 seconds 
respectively, which are 9 and 18 times higher than BRICK. 

For the weakness, like all the other bounded checkers, BRICK may not be 
able to give proofs of correctness of a program, if it can not finish the search 
in the given step bound. In this case, BRICK can only report bounded true. 
For example, on the cases of SV-COMP 2022, besides the 439 cases which are 
proved by BRICK, there are also several programs that BRICK can only give 
a bounded result or just timeout. Therefore, for the future work, we are im- 
plementing techniques including loop summary, k-induction and so on to try to 
abstract the loops and give a proof of the correctness in certain cases. 


4 Tool Setup and Configuration 


The binary file of BRICK for Ubuntu 20.04 is available at https://github.com/ 
brick-tool-dev/BRICK-2.0. To install the tool, please clone this repository and 
follow the instruction in README.md. A tailored version of BRICK took part 
in the ReachSafety/Floats category in SV-COMP 2022 [10]. The version [11] 
supports the checking of reachability of Error Function. The BenchExec wrapper 
script for the tool is brick.py and brick.zml is the benchmark description file. 


5 Software Project and Contributors 


BRICK is available under MIT License. The team of BRICK is from Software 
Engineering Group, Nanjing University. We would like to thank Sicun Gao for 
his kindly help with the usage of dReal. 


Data Availability Statement. All data of SV-COMP 2022 are archived as described 
in the competition report [10] and available on the competition web site. This includes 
the verification tasks, results, witnesses, scripts, and instructions for reproduction. 
The version of our verifier as used in the competition is archived together with other 
participating tools [11]. 
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Abstract. We sketch a sequentialization-based technique for bounded 
detection of data races under sequential consistency, and summarise the 
major improvements to our verification framework over the last years. 


Keywords: Bounded model checking - Context-bounded analysis - Se- 
quentialization - Data races - Reachability - Concurrency - Threads 


1 Verification Approach 


Our approach is based on lazy sequentialization [T]. The idea is to convert the 
concurrent program P of interest into a non-deterministic sequential program 
Qu,k that preserves all feasible executions of P up to unwinding bound u and 
k rounds (or execution contexts [8]). Among different techniques [6], we choose 
bounded model checking [3] to analyse Qu,x. In this section, we briefly overview 
lazy sequentialisation, and sketch a novel extension to detect data races. Further 
elements of novelty w.r.t. engineering of our tool are discussed in the next section. 


Lazy Sequentialization. We unwind all loops and inline all functions in P, 
except the main function and those from which a thread is spawned, obtaining a 
bounded program P, that preserves all feasible executions of P up to the unwind- 
ing bound u. We then transform each function of P, into a thread simulation 
function where each visible statement is assigned a numerical label and a guard, 
and each call to a concurrency-specific function is replaced by a call to a function 
that models the same intended semantics; for each simulation function, we add 
a global variable to represent the program counter, initially set to zero. 

A thread's execution context of P, is simulated by invoking the corresponding 
thread simulation function of Qu, that executes from the first statement to a 
non-deterministically selected label, updates the program counter, and returns. 
Further execution contexts are simulated by re-invoking the simulation function, 
where the guards ensure that the control is repositioned to the correct numerical 
label via a sequence of jumps, and so on. To retain consistency of the local state of 
the thread across different invocations of the simulation functions, static storage 


* This work has been partially funded by MIUR project PRIN 2017FTXR7S IT- 
MATTERS and MUR project FISR2020IP.05310 MVM-Adapt. 
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is enforced for all local variables. We drive the overall simulation of P,, from the 
main function of Qu,k, by invoking the thread simulation functions appropriately. 


Data Race Detection. A program contains a data race if it can execute two 
conflicting actions (i.e., one thread modifies a memory location and another 
one reads or modifies the same location), at least one of which is not atomic, 
and neither happens before the other [9]. Consider two threads performing the 
operation v = v+1 on a shared variable initialised to zero. Both threads try 
to modify the data at the memory location reserved for v, but the necessary 
sequences of memory accesses are not synchronised, and thus may interleave. If 
a context-switch happens between the memory read and write operations in the 
thread that runs first, both threads will read 0, and at the end of the execution 
the value of v will be 1. To detect such situation, we alter the encoding from P, 


k: to Q., by (i) adding a shared array w addrs that 
void *w_addr = &v; ; : 

Seasta adire [iN Re murs | oo a pointer to the memory location targeted 
w.addrs[0] = w.addr; by a write operation for each thread, (ii) injecting 
a t; additional control code at each visible statement, 
w.addrs[0] = 0; and (iii) splitting the modified sequentialised en- 


coding of the visible statement into two separate 
sequentialised statements to allow in-between context switching. The code frag- 
ment shows the modified sequentialised encoding (no guards for simplicity, in- 
jected code greyed out) for the statement v = v*1 of the first thread of the 
program described above. We store in w_addr the address of the variable being 
written, and then assert that the other thread is not writing to the same location; 
in the same (simulated) execution context, we store w addr in w addrs, so that 
the assertion can be checked within the other thread too. We reset w addrs right 
after the statement under consideration. Note the label k*1 that allows thread 
pre-emption. Now, one of the threads can execute the simulated statement at 
label k and context-switch at label k+1 while w addrs still points to v; this makes 
it possible to schedule the other thread, and fail the assertion in there. 

In the general case, handling multiple memory write accesses for a single 
statement requires a slightly different tracking mechanism for write addresses, 
or decomposition into simpler statements. Statements with read-only shared 
memory access are handled without updating w addrs. Programs with more 
than two threads require multiple assertions. 


2 Software Architecture 


CSeq is a framework for quick development of static analysis and program trans- 
formation prototypes. For parsing the input program CSeq relies on pycparserext 


pypi.org/project/pycparserext), an extension of pycparser (github.com/ 
eliben/pycparser), which in turn is built on top of PLY (www.dabeaz.com/ 


ply), a Python implementation of Lex and Yacc. All the mentioned components 
as well as CSeq are entirely written in Python. 

We combined several groups of modules in CSeq, namely (i) program sim- 
plification, (ii) program unfolding, (iii) sequentialization, (iv) instrumentation, 


A Prototype for Data Race Detection in CSeq 3 415 


and (v) backend invocation and counterexample generation. For the analysis of 
the sequentialised program we rely on CBMC (www. cprover . org/cbmc), that in 
turn embeds the DPLL-style MiniSat SAT solver (minisat.se). 

CSeq 3.0 incorporates a significant number of enhancements. At an architec- 
tural level, the main element of novelty is in the modularity between the general- 
purpose functionalities of the framework and the specific lazy sequentialization, 
which opens up to the possibility of prototyping different static analysers for 
other applications (e.g., [11]10]) as well as improving older sequentialization- 
based prototypes (e.g., [4T12]I3] and variations thereof). The enhancements to 
the framework include: Python 3 support, support for GNU C compiler exten- 
sions, a fully re-implemented symbol table, revised general-purpose modules such 
as constant propagation, function inlining, and loop unrolling, and a custom- 
built version of CBMC (not used in the competition) for SAT-solving under 
assumptions. For the competition we include (experimental) enhanced constant 
propagation, and simplified function inlining. Besides the data race checking 
extension, the sequentialization modules include improvements from earlier im- 
plementations and for different editions of SV-COMP up to date, in 
particular: extended pthread API support (conditional waiting, barriers, and 
thread-specific data management), context-bounded analysis, and a major code 
overhaul. 


3 Strengths and Weaknesses 


'The table below summarises the performance of our tool on the 764 cases of the 
Concurrency category and the 162 cases of the data race demo category. 


Overall instances 764 162 | Our technique excels at hunting bugs, as shown 
Correct S% 7 o Et by the number of correct unsafe (incl. 17 mal- 
unsafe : 

reject 9-19 | formed witnesses and 50 unconfirmed witnesses), 


Unknown ternal error — 18 17 | but gets quickly expensive with larger bounds, 
out of time 159 20 vius 7 ape 
out of memory 56 2 | hitting the resource limits. The additional context- 

Incorrect fe ; 7 7 switch points and the use of pointers for data race 
i detection introduce further overhead. The other 

failures are due to limiting assumptions or glitches in the implementation. AII 


the false positives are due to corner cases in the encoding. 


4 Setup and Configuration 


We competed in the ConcurrencySafety category and in the data race detection 
demo category. CSeq 3.0 is available at 

Installation instructions are in the README file within the package. A wrap- 
per script (lazy-cseq.py) invokes CSeq up to three times, with the options 
-1 lazy for lazy sequentialisation, --sv-comp to enable the required violation 
witnesses format, --atomic-parameters to assume atomic passing of function 
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arguments, --nondet-condvar-wakeups for non-deterministic spurious condi- 
tional variables wake-up calls, --deep-propagation for experimental constant 
folding and propagation, --32 for 32-bit architectures, --threads 100 to limit 
the overall number of threads, --data-race-check when required, and --backend 
cbmc to use CBMC 5.4 for sequential analysis. 

For reachability checking, on different invocations the script adds different pa- 
rameters: -r2 -w2 -f2, -r4 -w3 -f5, and -r20 -w1 -f11, where r is the number 
of rounds, and f and w are the unwind bounds for for (i.e., potentially bounded) 
and while (i.e., potentially unbounded) loops, respectively; on the last invoca- 
tion --softunwindbound and --unwind-for-max 10000 are also added to fully 
unfold for loops if a static bound can be found, up to the given hard bound. For 
data race detection, the above parameters are replaced with -c4 -u2, -c10 -u10, 
and -c50 -w20 -£20 with --unwind-for-max 100. Note that in this case the 
bound is on the number of execution contexts rather than rounds (-c vs. -r), 
and -u is used as a shorthand for -f and -w. 

We leave the analysis running to completion every time. When the result is 
TRUE, the scripts restarts the analysis with the next set of parameters. As soon 
as the script gets FALSE, it returns FALSE. Only if the analysis using the last set 
of parameters is finished and the result is TRUE, then the script returns TRUE. 


Data Availability Statement. All data of SV-COMP 2022 are archived as described 
in the competition report [1] and available on the [competition web site] This includes 
the verification tasks, results, witnesses, scripts, and instructions for reproduction. 
The version of our verifier as used in the competition is archived together with other 
participating tools [2]. 
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Abstract. The validation of violation witnesses is an important step 
during software verification. It hides false alarms raised by verifiers from 
engineers, which in turn helps them concentrate on critical issues and 
improves the verification experience. Until the 2021 edition of the Com- 
petition on Software Verification (SV-COMP), CPACHECKER was the 
only witness validator for the ConcurrencySafety category. This article 
describes how we extended the DARTAGNAN verifier to support the valida- 
tion of violation witnesses. The results of the 2022 edition of the competi- 
tion show that, for witnesses generated by different verifiers, DARTAGNAN 
succeeds in the validation of witnesses where CPACHECKER does not. 
Our extension thus improves the validation possibilities for the overall 
competition. We discuss DARTAGNAN’s strengths and weaknesses as a 
validation tool and describe possible ways to improve it in the future. 


1 Introduction 


Most software verification tools report witnesses to property violations. Since 
SV-COMP 2015, there is a common format in which witnesses are represented 
by automata [4]. Each edge of such an automaton is annotated with data that 
can be used to match program executions. A data annotation can be, e.g., “as- 
sumption” specifying constraints on values of variables in a given state, “control” 
specifying the outcome of a branch condition, or “startline” specifying a concrete 
line in the source code. More details about data annotations and their semantics 
can be found in the exchange format documentation [1]. 

A witness validator checks that a violation can be reproduced using the 
information provided by the witness. Automata-based verifiers can easily be 
converted into validators by analyzing the synchronized product of the program 
with the witness automaton. In this setting, the witness automaton guides the 
verifier. If none of the outgoing edges on the program state match the next 
edge of the witness automaton, then the verifier cannot explore the current path 
further. If the edge on the program state matches, then the witness automaton 
and the program proceed to the next state, eventually leading to a violation. 


* Jury member. 
© The Author(s) 2022 
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While this idea allows one to easily convert any automata-based verifier into a 
validator, not all verifiers are automata-based. 

DARTAGNAN is an SM'T-based verifier. In the next section, we explain how to 
convert it into a validator. The idea is to extract information from the witness 
and use it to reduce the search space explored by the backend SMT solver. 


2 Validation Approach 


Given a concurrent program and a specification in the form of assertions, DARTAG- 
NAN generates an SMT formula Yver = (cr ^ Yor ^ Psc ^ Py Which is satisfiable 
if and only if some assertion fails [17,16]. The formulas yor and ppr encode 
(respectively) the control flow and the data flow of the program. Formula Yso 
encodes scheduling constraints. Finally, yy expresses that at least one asser- 
tion must fail. If the formula is satisfiable, then a violation exists. The goal of 
DARTAGNAN (as a verifier) is to find such a violation. This amounts to finding 
an appropriate scheduling among the threads. Such a scheduling is encoded as 
a happens-before relation between the instructions. DARTAGNAN thus searches 
the space of all viable happens-before relations to find a violation or prove that 
none exists. 

We now explain how to extend DARTAGNAN into a violation witness validator. 
The idea is to extract from the violation witness a formula ye that we conjoin 
to the rest of DARTAGNAN's encoding, resulting in vai = Yver ^ Yæ . The 
extra constraints in yw reduce the search space for the SMT solver. For the 
verification of concurrent programs taking inputs from the environment, there 
are two sources of non-determinism: the data coming from the input (which 
might influence the control flow) and the scheduling. The purpose of ye is to 
reduce this non-determinism. Extending the SMT encoding as described in eva; 
is conceptually easy. The interesting question is ^what information from the 
witness shall we use?” The less information we use, the more we move from 
pure validation to full verification. 

While automata-based validators can use some information in a straight- 
forward manner, this is not the case for DARTAGNAN. 


1. A violation witness can contain cycles to represent infinitely many execu- 
tions. However, SMT-based tools unroll cycles and perform bounded verifi- 
cation, thus only part of this information is helpful. 

2. Since DARTAGNAN (as many other BMC tools) does not keep an explicit 
notion of state, using state information is not trivial. 


The exchange format for violation witnesses allows for expressing informa- 
tion about state assumptions, the control flow, and the scheduling. We abstract 
out from the former two and only use scheduling information. We assume that 
witness automata represent a single path and that the edges contain “startline” 
data corresponding to read or write instructions! . Those are the only instructions 


1 Our validator accepts witnesses that do not satisfy the second assumption, but it 
filters out the corresponding edges. 
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that can affect our happens-before relation. While we do not explicitly encode 
the outcome of control-flow instructions, certain control-flow information is im- 
plicitly encoded based on which instructions are executed. We explain the reason 
behind these design decisions and assumptions, discuss its limitations, and de- 
scribe how we plan to improve this in the future in Section 3. Despite these 
limitations, and as we show in Section 4, our validator performs well in practice. 
Let (S, E) be a witness automaton with states S and edges E. For each 
e € E, function e2i(e) returns the set of read or write instructions coming from 
the “startline” in the C file that corresponds to the given edge. Since witnesses 
represent single paths, they can be seen as a word over S. Let w € S* bea 
witness, we define the witness-to-formula function which constructs ye as 


true if w = € 


w2f(w) =< w2f(w’) A V happens-before(i1,i2) if w= s- w 
i1 €e2i((-,s)) 
i2€e2i((s,_)) 


3 Strengths and Weaknesses 


The main strengths of our validation approach are simplicity and modularity. 
The approach just requires to add a new sub-formula to the SMT encoding 
used for verification. The validator is modular in the sense that using more or 
different information from the witness does not change the validation approach. 
For example, adding information from the witness about the control flow just 
requires adding more constraints to ye. 

Our validation approach assumes that witness automata represent single 
paths. This is a limitation not imposed by the exchange format. However, veri- 
fiers tend to stop as soon as they find one violation and thus generate witnesses 
representing a single violation path. A second limitation is that we do not ex- 
plicitly consider control-flow information. This might impact the performance 
of the validation since not all non-determinism is removed and the search space 
might still be large. Converting such control-flow information into SMT is simple 
in principle. However, since DARTAGNAN internally converts the C program into 
Booaik [15], matching conditionals with the corresponding assembly-like jumps 
requires some work. A second consequence of not extracting control-flow infor- 
mation from the witness is that we might validate witnesses that do not lead 
to a violation. This is because we over-approximate the paths of the program 
represented by the witness and thus our approximation might include the path 
leading to the violation even if the witness did not. 


4 Validation Results 


We inspected the results of SV-COMP 2022 [5] to answer the following questions 


RQ1: What percentage of the witnesses can DARTAGNAN validate? 
RQ2: What percentage can DARTAGNAN not validate and why? 
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RQ3: Can DARTAGNAN validate witnesses that CPACHECKER cannot? 
RQA: Can CPACHECKER validate witnesses that DARTAGNAN cannot? 


From the 20 verifiers in ConcurrencySafety, we selected five tools imple- 
menting different verification approaches. We consider them good representa- 
tives of the whole category: (i) CBMC [13] (used as a backend by DEAGLE [9] 
and LAZY-CSEQ [11]), (ii) CPAchecker [7] (used as a backend by CPA- 
LOCKATOR [3] and GRAVES [14]), (iii) EBF [2] (combines BMC with fuzzing, a 
very effective technique to find bugs), (iv) Dartagnan [17] (only tool where the 
memory model, here sequential consistency, is taken as an input), and (v) Gem- 
Cutter [12] (shares the codebase with U'TAIPAN [8] and UAUTOMIZER [10]). 

Table 1 presents the results of the validation in SV-COMP 2022. We report 
the number of witnesses generated by each verifier (“WITNESSES”). For each 
of the validators (columns “DARTAGNAN” and *CPACHECKER"), we report the 
number of cases where the validation conclusively finished (i.e., it returned TRUE 
or FALSE), whether the violation was confirmed (left of “/”) or not (right of */"), 
and the number of correct validations by one tool where the other did not report 
a result (columns “DART V CPA” and “CPA \ DART”, respectively). 


TOOL WITNESSES| DARTAGNAN||CPACHECKER |DART V CPA||CPA \ DART 
CBMC 305 193/0 95/0 117 19 
CPACHECKER 256 0/0 256/0 0 256 
DARTAGNAN 273 245/1 35/6 204 0 
EBF 290 219/0 57/0 177 15 
GEMCUTTER 299 18/237 262/1 15 28 


Table 1. Results of the validation in SV-COMP 2022. 


For the SMT-based verifiers CBMC and EBF, DARTAGNAN has 63.28% 
resp. 75.52% success rate in the validation (against 31.15% resp. 19.66% success 
rate for CPACHECKER). Unfortunately, it did not validate any of the witnesses 
generated by CPACHECKER. This was due to a bug in the witness parser that 
has been identified and fixed after the competition. CPACHECKER validated all 
the witnesses that it generated as a verifier. DARTAGNAN validated 89.74% of its 
own witnesses while CPACHECKER only validated 12.82%. For GEMCUTTER, the 
validation success of DARTAGNAN is only 6.02%. This is because, due to another 
bug, it wrongly marked 237 witnesses as not validated. The fixed version of 
DARTAGNAN is able to validate all such cases. Despite this, from the 18 witnesses 
that DARTAGNAN validated, 15 of them were not validated by CPACHECKER, 
thus improving the validation possibilities for the overall competition. 


5 Software Project and Configuration 


The project home page is https://github.com/hernanponcedeleon/Dat3M. To 
run DARTAGNAN as a validator, use the following command: 


$ Dartagnan-SVCOMP.sh -witness <witness> <property> <program> 
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Data Availability Statement. All data of SV-COMP 2022 are archived as described 
in the competition report [5] and available on the competition web site. This includes 
the verification tasks, results, witnesses, scripts, and instructions for reproduction. 
'The version of our verifier as used in the competition is archived together with other 
participating tools [6]. 
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Abstract. Deagle is an SMT-based multi-threaded program verification 
tool. It is built on top of CBMC (front-end) and MiniSAT (back-end). The 
basic idea of Deagle is to integrate into the SMT solver an ordering con- 
sistency theory that handles ordering relations over the shared variable 
accesses in the program. The front-end encodes the input program into 
an extended propositional formula that contains ordering constraints. 
'The back-end is reinforced with a solver for the ordering consistency 
theory. This paper presents the basic idea, architecture, installation, and 
usage of Deagle. 


Keywords: Program verification - Satisfiability modulo theories - Con- 
currency. 


1 Verification Approach 


Given a multi-threaded program, the thread communication behaviors can be 
modeled using the happens-before relations over memory access (read/write) 
events [1]. There are various kinds of happens-before relations: program order 
(PO), read-from order ( RF), write serialization order ( WS), and from-read order 
(FR). A happens-before ordering formula (abbreviated as ordering formula) is 
a logical formula that involves only memory access events and happens-before 
relations. 

Deagle is an SMT-based multi-threaded program verifier, which consists of 


— a front-end that encodes the intra-threaded behaviors (e.g., the control and 
data flow per thread) into propositional formulas, and the inter-threaded 
behaviors (i.e., the communication between threads) into ordering formulas; 

— a back-end that extends MiniSAT with an ordering consistency theory solver 
[8] by following the DPLL(T) framework [7], and is able to solve propositional 
formulas and ordering formulas mixed together. 


* This work was supported in part by the National Key Research and Development 
Program of China (No. 2018YFB1308601) and the National Natural Science Foun- 
dation of China (No. 62072267 and No. 62021002). 


© The Author(s) 2022 
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Compared with [8]: The theory solver in [8] uses a from-read axiom to de- 
rive FR orders. Besides the from-read axiom, Deagle also implements a write- 
serialization axiom |11], with which WS orders can also be derived. In return, 
the front-end of Deagle need not encode both FR and WS orders explicitly. 


2 Software Architecture 


Deagle is developed on top of CBMC [9] and MiniSAT [6] using C++. Addition- 
ally, for ease of usage and debugging, Deagle reuses some modules developed 
in Yogar-CBMC [10,11]. Deagle is not a strategy selection-based verifier. Deagle 
runs the following procedures successively to verify a given C program: 


Preprocessing (from Yogar-CBMC) For each global structure variable in the 
C program, the preprocessing procedure unfolds it by creating a fresh variable 
for each member. Note that arrays need no preprocessing; CBMC is able to handle 
each array as an entity. 


Parsing and Goto-Program Generation (originally in CBMC) CBMC em- 
ploys Flex and Bison to transform the preprocessed C program into an abstract 
syntax tree (AST). Then CBMC builds a goto program, where all branching state- 
ments and loop statements are represented with (conditional) goto statements. 


Library Function Modeling (extended from CBMC) CBMC models each 
multithreading-related library function (e.g., pthread_cond_wait). For example, 
mutex m contains a Boolean variable m_locked indicating whether m is locked; 
pthread_mutex_lock(&m) assumes m locked to be originally false and sets 
m_locked to true. Based on CBMC, we extend Deagle to support the modeling of 
more library functions. 


Unwinding We employ bounded model checking (BMC) [3,4,5] to handle loops. 
If the program contains loops, we determine an unwinding limit and unwind the 
program to a loop-free bounded program: 


— If the maximal loop time of the program can be determined through static 
analysis, e.g., 
for (i =0;i < 10; i + +) 
we set the unwinding limit to this maximal loop time; 
— If the maximal loop time depends on non-determinism. e.g., 


for (i 2 0;i < n;i + +) 


where n is attained from the function ..VERIFIER nondet. int, we report 
UNKNOWN since such loops cannot be fully unwound. 
— Otherwise, we set the unwinding limit to 2. 
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Formula Generation (extended from CBMC) After unwinding, the loop-free 
program is represented in the static single assignment (SSA) form, where each 
thread is a chain of assignments. These assignments can be directly modeled 
into first-order logic formulas (for ease of solving, we further convert them into 
propositional logic formulas). Additionally, an assignment may contain global 
memory access events; we model program orders and read-from orders (please 
refer to [8] for more information) of these events into the formulas. 


Constraint Solving (extended from MiniSAT) We develop an ordering con- 
sistency theory solver and integrate it into the DPLL(T) framework [8]. For 
efficiency, we extend MiniSAT, an SAT-based solver, to run our theory solver ex- 
clusively. Please refer to [8] for the detailed algorithms of our decision procedure. 


Witness Generation (adapted from Yogar-CBMC) If the back-end solver 
returns satisfiable (i.e., finds a counterexample violating the property), our or- 
dering consistency theory solver reports a sequence (total order) of these events, 
which can be used for generating the witness of the counterexample. 


3 Strengths and Weaknesses 


Compared to the traditional method [1] which explicitly converts ordering for- 
mulas into propositional formulas, Deagle employs a dedicated theory solver to 
handle ordering formulas, which improves both time and space efficiency. Ignor- 
ing some tasks in goblint-regression that require unwinding 10000 times, Deagle 
reports TIMEOUT in only 9 tasks and OUT OF MEMORY in only 7 tasks — 
fewer than most ConcurrencySafety competitors. 

In most weaver tasks (117 out of 169), the number of loop iterations is non- 
deterministic. As is mentioned in previous section, Deagle reports UNKNOWN. 
Since such tasks are common in real-world programs, we are exploring an ap- 
proach to dealing with such programs in the future work. 


4 Tool Setup and Configuration 


The source code of Deagle 1.3 (the submitted version in SV-COMP 2022 [2]) is 
publicly accessible ^. Please refer to README for more installation instructions. 
In SV-COMP 2022, Deagle participates in ConcurrencySafety category and only 
checks property Unreach-Call ?. By setting parameters 


— — 32 — —no — unwinding — assertions — —closure 


one can reproduce Deagle’s results of SV-COMP 2022. 


^ Deagle repository: https:/ /github.com/thufv/Deagle 
5 The benchmark definition of Deagle: https://gitlab.com/sosy-lab/sv-comp/ 
bench-defs/- /blob/main/benchmark-defs/deagle.xml 
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4.1 Parameter Definition 


Deagle inherits lots of parameters from CBMC. Due to the page limit, we only 
describe parameters related to the competition or newly added in Deagle: 


* — — 32/ — —64: sets the width of integers to 32/64. 
——no—unwinding—assertions: does not generate unwinding assertions into 
the formula. Assuming a loop is unwound n times, its unwinding assertion 
asserts the loop condition to be false after n iterations. Since unwinding 
assertions can lead to false counterexamples, we disable the generation of 
unwinding assertions. 

— — closure/ — —icd (new in Deagle): uses our proposed approach. Once 
the parameter — — closure is enabled, Deagle employs a transitive closure- 
based theory solver (recommended). If — — icd is enabled, Deagle employs 
an incremental cycle detection-based solver. In SV-COMP 2022 [2], Deagle 
solves all tasks with the parameter — — closure. 


5 Software Project 


Deagle is developed by Fei He, Zhihang Sun, and Hongyu Fan from the Formal 
Verification Lab? in Tsinghua University. Deagle is licensed under GPLv3. Since 
Deagle is developed over CBMC and MiniSAT, and reuses some modules from 
Yogar-CBMC, it also contains copyright of those tools. 
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Abstract. Frama-C is a well-known platform for source-code analysis of 
programs written in C. It can be extended via its plug-in architecture by 
various analysis backends and features an extensive annotation language 
called ACSL. So far it was hard to compare Frama-C to other software 
verifiers. Our competition participation contributes an adapter named 
Frama-C-SV, which makes it possible to evaluate Frama-C against other 
software verifiers. The adapter transforms standard verification tasks 
(from the well-known SV-Benchmarks collection) in a way that can be 
understood by Frama-C and produces a verification witness as output. 
While Frama-C provides many different analyses, we focus on the Evolved 
Value Analysis (EVA), which uses a combination of different domains to 
over-approximate the behavior of the analyzed program. 


Keywords: Software verification - Program analysis - Formal methods - Compe- 
tition on Software Verification - Comparative Evaluation - SV-COMP - Frama-C 


1 Approach 


This competition contribution is based on Frama-C [12], a program-analysis 
platform for C programs. The purpose of the participation in the comparative 
evaluation SV-COMP is to show the strengths of FRAMA-C when applied to 
the problem of verifying C programs from the SV-Benchmarks [4] collection of 
verification tasks. 


2 Architecture 


Although FRAMA-C has a large configuration space, it does not support standard 
specifications as used in SV-COMP, and it does not produce verification witnesses 
as default. In order to overcome this obstacle we implemented an adapter for 
FRAMA-C using input and output transformers, and the adaption architecture 
is illustrated in Fig. 1. In the following, we describe the artifacts and actors of 
the participating verifier: in Sect. 2.1 we describe all the components that are 
developed as part of the adapter, while in Sect. 2.2 we describe in more detail 
how the used EVA analysis of FRAMA-C works. 


(9 The Author(s) 2022 
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FRAMA-C-SV 
| ITRUE | 
Program : D. Program’ T Witness | 
Input Configuration Frama- [DUÉPUÉ Output [i , UNKNOWN 
Transformer Options Transformer 
Specification i Harness ] ' FALSE : 
| R Witness ' 


Fig. 1: Architecture of FRAMA-C-SV: the inputs and outputs of FRAMA-C are 
translated to interface with the established standards as used by SV-COMP; the 
components that are necessary to adapt FRAMA-C for comparison with other 
verifiers amount to 678 lines of code mostly written in Python 


2.1 FRAMA-C-SV 


Input Transformer. The input transformer takes the program p and speci- 
fication s and creates a new program p’ in which the specification s has been 
expressed as FRAMA-C-specific annotations. FRAMA-C uses ACSL [1] as language 
to specify annotations. The input transformer also selects configuration param- 
eters for FRAMA-C that are best suited for the verification task. Currently we 
encode reachability tasks into signed integer overflows by adding an artificial 
overflow to the body of the function reach, error. This works well in practice 
and is also sound, since if there were any other overflows, the task would contain 
undefined behavior and would not be a valid reachability task in the first place. 


Configuration Options. Depending on the input program and specification, we 
can choose different options that are passed to FRAMA-C. In essence, this acts like 
an algorithm selection [14] and, e.g., allows us to choose a different configuration 
of FRAMA-C depending on the specified property. 


Harness. Some programs in the SV-Benchmarks collection use specific func- 
tions to model non-determinism. We provide implementations for those functions 
(__VERIFIER_*) in a separate C program such that the semantics of those func- 
tions can be understood by FRAMaA-C. This separate C program is passed to 
FRAMA-C together with the transformed program p'. 


Output Transformer. The output of FRAMA-C needs to be interpreted regard- 
ing the original specification, and depending on the outcome, a verification witness 
needs to be generated. Thus, we need an output transformer for (a) providing a 
verdict for the verification task and (b) providing a verification witness. Regard- 
ing (a), the output transformer interprets the CSV report that can be generated 
by Frama-C to determine whether the program was proven to be safe (verdict 
TRUE), whether a specification violation occurred (verdict FALSE), or whether 
no such statement can be made (verdict UNKNOWN). We also generate a minimal 
correctness or violation witness for the verdicts TRUE and FALSE, respectively. 
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The witness automata consist of only one node, which for violation witnesses is 
marked as violation node. In the future we plan to augment these witnesses with 
information such as invariants that have been found by FRAMA-C. 


2.2 FRAMA-C 


One of the strengths of Frama-C is its modular architecture [10], which allows 
a configuration of the best possible analysis backends for a certain verification 
problem. We choose the plug-in EVA [9], which is well suited for an automatic 
analysis. Other plug-ins such as the Weakest-Preconditions (WP) plug-in require 
hints from the user in order to be effective. In the following we will briefly describe 
the most important aspects of the EVA analysis configuration that we use. For a 
more detailed description, we refer the reader to the relevant literature [7, 8, 9]. 

FRAMA-C provides a meta-option called -eva-precision for the EVA plug-in 
with possible values ranging from 0 to 11. With higher values for this option more 
precise domains and thresholds are used, at the cost of increased computation 
time. We currently use the maximum value of 11 in order to make the best use 
of the 900s CPU time limit. In the future we might want to iteratively increase 
this value starting at lower precisions. 


Domains. The EVA analysis always uses the domain cvalue, which tracks values 
of variables either as constant values, sets, or intervals of possible values (including 
modular congruence constraints). For pointer addresses, these are either tracked 
as addresses with offsets or as so-called garbled mix, which overapproximates the 
set of possible memory locations. In addition, depending on the precision level, 
various other domains are used that we describe in the following. The domain 
symbolic-locations tracks a map of symbolic locations to values, which is, e.g., 
helpful for analyzing expressions containing array accesses such as a[i]«a[jl. 
'The equality domain tracks equalities of C expressions found in the code, whereas 
the gauges domain tracks relations between variables in a loop with the goal 
to discover linear inequality invariants [16]. Lastly the octagon domain tracks 
certain linear constraints between pairs of variables [13]. As we use the highest 
precision level, all of these domains are used in our contribution. 


Precision of the State-Space Exploration. Apart from the domains, the 
precision of state-space exploration in FRAMA-C is affected by various options. We 
will describe some of these in the following; a complete list of affected settings and 
values is always printed by FRAMA-C when the option eva-precision is specified 
by the user. Option slevel (set to 5000) determines how many separate states 
are kept before new states will be joined into existing ones. Option ilevel (set 
to 256) determines how many different values are tracked per variable before 
overapproximating the value range. Option plevel (set to 2000) affects the size 
up to which arrays are tracked. The option auto-loop-unroll (set to 1024) will 
determine up to which bound a loop is considered for unrolling. 
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3 Strengths and Weaknesses 


The competition contribution shows the strengths of FRAMA-C in checking C pro- 
grams for overflows and also —in the currently supported sub-categories !— for 
reachability. Here we are able to show that our results are comparable and often 
surpass those of other tools based on abstract interpretation [11] such as Gos- 
LINT [15]. While the EVA analysis of FRAMA-C that we use is based on abstract 
interpretation, the precision options described in Sect. 2.2 allow for a more precise 
state-space exploration, which behaves more like model checking. More details 
about the results can be found in the competition report [2] and artifact [3]. 

The approach that we describe in this paper creates a compatibility 
layer between the abilities used by FRAMa-C and the standards used in the 
SV-Benchmarks collection. While still a work in progress, we have shown that 
it is possible to bridge this gap while preserving overall soundness. It is also 
interesting to consider the results on verification tasks from the SV-Benchmarks 
collections for a tool that did not participate before. 

Although our approach is sound in general, we are likely not showcasing the full 
potential of FRAMA-C. One aspect to consider here is the large configuration space, 
which means there might be ways to verify more tasks with a better heuristic 
for selecting the configuration options. The other aspect is that FRAMA-C also 
provides different plug-ins such as the WP plug-in, which requires more (manual) 
annotations, but can also potentially solve more tasks than the more automatic 
EVA plug-in. 


4 Software Project and Contributors 


The software project FRAMA-C is developed at https://git.frama-c.com/ 
pub/frama-c/ and our adapter FRAMA-C-SV is developed at https: //gitlab. 
com/sosy-lab/software/frama-c-sv, both being released under open-source 
licenses. The exact version of the adapter that participated in SV-COMP 2022 
is also archived in the competition’s tool-archive repository ? [6]. FRAMA-C was 
funded by the European Commission in program Horizon 2020. The adapter 
FRAMA-C-SV was funded by the DFG. We thank the FRAMA-C authors ? for their 
contribution to the software-verification community. 


Data Availability Statement. All data of SV-COMP 2022 are archived as described 
in the competition report [2] and available on the competition web site. This includes 
the verification tasks [4], competition results [3], verification witnesses [5], scripts, and 
instructions for reproduction. The version of Frama-C-SV as used in the competition is 
archived together with other participating tools [6]. 


Funding Statement. This work was funded in part by the Deutsche Forschungsge- 
meinschaft (DFG) — 378803395 (ConVeY). 


1 We opted out of subcategories with unsound results caused by Frama-C making 
assumptions that are different from the conventions of SV-COMP. 

2 https://gitlab.com/sosy-lab/sv-comp/archives-2022/blob/svcomp22/2022/frama-c-sv.zip 

: https://frama-c.com/html/authors.html 
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Abstract. GDART is an ensemble of tools allowing dynamic symbolic 
execution of JVM programs. The dynamic symbolic execution engine is 
decomposed into three different components: a symbolic decision engine 
(DSE), a concolic executor (SPouT), and a SMT solver backend allow- 
ing meta-strategy solving of SMT problems (JConstraints). The symbolic 
decision component is loosely coupled with the executor by a newly in- 
troduced communication protocol. At SV-COMP 2022, GDART solved 
471 of 586 tasks finding more correct false results (302) than correct true 
results (169). It scored fourth place. 


Keywords: Dynamic Symbolic Execution - Software Verification 


1 Verification Approach 


'This paper presents the GDART ensemble tool, a dynamic symbolic execution 
engine for the JVM. Dynamic symbolic execution is a well-established technique 
for software testing (cf. DART [6]) and there have been already two contestants 
to SV-COMP 2021 using this technique (cf. JDART [7,9] and COASTAL’). 
It is a search algorithm for systematic exploration of a program's state space 
for a property violation which either stops after exhausting the resource limits, 
exploring the complete symbolic state space, or encountering an error. The end 
of the search is fully configurable in GDART. 

In SV-COMP 2022 [3], a dynamic symbolic execution tool (JDART (714 
Points)) wins the JAVA track for the first time beating JBMC (700 Points) [4], 
a bounded model checker for JAVA, and JAVA RANGER (670 Points) [11], a 
symbolic execution engine extended by veritesting [1] for JAVA. JDART’s result 
underlines the potential of dynamic symbolic execution for the verification of 
JAVA programs in general. The concrete implementation of JDART is closely 
coupled to the Java PathFinder VM (JPF-VM) [12] running the complete anal- 
ysis within one virtual machine. The advantage of the JPF-VM is that it runs 


* This work has been partially founded by an Amazon Research Award 
3 https: //github.com/DeepseaPlatform/coastal 
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as a guest JVM on top of a host JVM. The analysis might mock parts of the 
guest JVM and use the host JVM for running side computation required to 
compute results used in the mock. The downside of the JP F-VM is its research 
tool status and that it is costly to maintain it given JAVA's fast pace in releasing 
new features. 

COASTAL demonstrated for the first time what a loosely coupled architec- 
ture between the symbolic exploration engine and a concolic execution engine 
might look like. It instruments the bytecode with ASM?, a java bytecode manip- 
ulation framework, to obtain symbolic traces. This makes the analysis indepen- 
dent of the JPF-VM. The downside is that bytecode manipulation offers less 
flexibility than hooking directly into the JVM. 


2 Software Architecture 


SMT Problem Symbolic Concrete Values 
Exploration —— 
| (DSE/ JCONSTRAINTS) 
Constraint- ji T Concolic 
Solving Execution 
(CVCA, Z3, ...) Model Symbolic Trace (SPovT) 


Fig. 1: GDart's ensemble architecture and the interplay between the components. 


GDART takes the strengths of JDART's mocking flexibility and combines it 
with COASTAL’s modular design. Figure 1 demonstrates the architecture of the 
GDART ensemble tool. The main analysis component is the symbolic explorer. 
It orchestrates the concolic executor and requests solutions for SMT problems 
from the constraint solvers powering the symbolic exploration. 


Symbolic Exploration. We name the symbolic explorer “DSE” component as it 
does symbolic exploration and starts the concolic executor, the two main steps in 
applying dynamic symbolic execution. It manages the constraint tree and guides 
its exploration. Both steps together are the main tasks of a dynamic symbolic 
execution engine. To explore a path, it computes a set of concrete values that 
drives the concolic executor down the path of interest and seeds the executor 
with these values. After the termination of the executor, it parses the obtained 
symbolic trace and integrates it into the symbolic tree. Next, it constructs from 
the symbolic tree a SMT problem that describes the next path to explore and 
starts a constraint solver to get a model suitable to drive the execution down 
this path or an unsatisfiable verdict implying that the path is unreachable. The 


^ https: //asm.ow2.io 
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search behavior of GDART is configured in the DSE. Once the search terminates, 
DSE generates a verification witness from the constraint tree. 


Concolic Executor. One of the core contributions of GDART is the new concolic 
executor SPOUT implemented as part of the Espresso guest language running on 
top of the GRAALVM [13]°. The GRAALVM is an industrial-grade JVM main- 
tained by Oracle allowing to use most of the architectural benefits the JPF-VM 
offered apart from state tracking. But concolic execution does not require JPF- 
VM’s state tracing feature. SPOUT can be seeded with concrete values to drive 
down the execution along a concrete path. In addition, it can introduce new 
symbolic variables for previously unknown inputs. During execution, it records 
manipulation and constraints checks on symbolic variables and reports a sym- 
bolic execution trace together with the concrete execution result on termination 
of the path exploration. Decisions on the symbolic variables are encoded in the 
SMT-Lib format. As SPouT maintains the two VM layers, it allows mocking 
of behavior in the Espresso VM running the analysis and implements a substi- 
tute executed on the host GRAALVM during concolic execution the same way 
JDART does for mocking the environment if needed. The feature is also used 
for intercepting invocations of the string library in JAVA and encoding them 
symbolically. 


Constraint Solving. The third component is constraint solving. DSE uses the 
JCONSTRAINTS library to model SMT-Lib constraints internally and interact 
with the solver. GDART is backed by CVC4 [2] and Z3 [5]. We combine these 
two SMT solvers in a portfolio approach according to the CVCSEQEVAL strategy 
presented in our previous work [8]. 


3 Strengths and Weaknesses 


GDART is the fourth place with 640 points behind JDART (714 points), JBMC 
(700 points), and JAVA RANGER (670 points). Dynamic symbolic execution tools 
tend to be stronger in finding property violations than confirming the absence of 
property violations on the SV-COMP benchmark. This is partially by design as 
some of the problems (e.g., those problems in the jayhorn-recursive subgroup) 
aim for testing the handling of tremendously large and hard to explore state 
spaces. GDART disproves the property in 302 cases and confirms it in 169 cases. 
In total, GDART answered 471 of 586 tasks correctly and none incorrect. These 
are 40 more correct false proved tasks than JAVA RANGER found (262 correct 
false tasks out of 466 solved tasks). In total GDART solved five more tasks than 
JAVA RANGER and 35 less than JBMC. 

In direct comparison with GDART, JDART solved 192 (+23) correct true 
tasks and 330 (--28) correct false tasks. Three factors are contributing to the gap 
between GDART and JDART: the performance overhead of spinning up one JVM 
per executor run (We do not have the exact number, but spinning up a JVM 
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costs at least 500 ms per JVM affecting especially tasks with huge exploration 
trees.), technical maturity of the implementation as JDART is around for more 
time, and a value tracing heuristic built into JDART for tracking numerical 
values origin from a serialized string representation not built into GDART. The 
performance overhead for spinning up multiple JV Ms is the only drawback that 
is influenced by the modular design of GDART and will not go away in the 
future. JDART's time per task after archiving 600 points is close to five seconds 
CPU time in the score-based quantile plots for CPU time while GDART's time 
per task reaches close to 50 seconds CPU time for the same score. 

The weakness of dynamic symbolic execution is state space explosion which 
also affects GDART. Slowing down each executor run by spinning up new VMs 
is a disadvantage given the resource constraints of SV-COMP. On the bright 
side, with more relaxed resource limits it is possible to run the execution runs in 
parallel to the symbolic exploration of the constraints tree as future work for the 
DSE component allowing parallel breadth-first search on multi-core machines. 
At the moment all paths are explored sequentially. 


4 Tool Setup 


GDART is run with various configuration options hard-coded into the SV-COMP 
run scripts. More precisely, we enabled witness generation, used the described 
solver strategy in the constraint backend, chose a breadth-first search on the 
constraint tree, and used the same bounded solving as JDART. The search is 
configured to terminate on the first hit assertion error. 


5 Software Project 


'The components are currently all developed at TU Dortmund by the group led 
by Falk Howar. DSE? is available under the Apache 2.0 license, JCONSTRAINTS" 
as well, and SPouT® is available under the GPL v2 license. We also provide the 
run scripts for SV-COMP on GitHub?. 


6 Data Availability Statement 


The GDART archive used for SV-COMP 2022 is available at Zenoodo [10]. 
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Abstract. GRAVES-CPA is a verification tool which uses algorithm se- 
lection to decide an ordering of underlying verifiers to most effectively 
verify a given program. GRAVES-CPA represents programs using an 
amalgam of traditional program graph representations and uses state- 
of-the-art graph neural network techniques to dynamically decide how 
to run a set of verification techniques. The GRAVES technique is imple- 
mentation agnostic, but it’s competition submission, GRAVES-CPA, is 
built using several CPAchecker configurations as its underlying verifiers. 


Keywords: Software Verification - Graph Attention Networks - Graph 
Neural Networks - Algorithm Selection 


1 Verification Approach 


GRAVES-CPA is an algorithm selector for software verification based on graph 
neural network techniques. As the tool PeSCo [14] has shown, dynamic order- 
ing of verification techniques can result in faster and more accurate verification. 
Computing an ordering on techniques dynamically will incur some runtime, but 
an effective ordering will oftentimes make this overhead insignificant in compari- 
son to the time saved by using a more appropriate technique. Like most algorithm 
selectors, GRAVES-CPA uses machine learning to make its selections. However, 
it uses graph neural networks (GNNs) so it can represent programs using tra- 
ditional program abstractions, such as abstract syntax trees (ASTs). GRAVES- 
CPA uses a variant of GNNs called Graph Attention Networks (GATs) [16]. 
GATs use a learned attention mechanism which is trained to learn the impor- 
tance of edges in a given graph. 

GNNSs are an emerging field in machine learning. Traditional neural networks 
accept input vectors, which have a fixed size and a natural ordering on elements, 
but graphs, in general, have neither. GNNs avoid these issues by operating on 
individual nodes in the graph, instead of the graph as a whole [15]. Typically, the 
input to a GNN is the current representation of a node v and a collation of the 
representations of its neighboring nodes. The output is then a new representation 
for v. This process is repeated independently for all nodes in the graph. Thus, the 
number of nodes in the graph and order in which they are processed is irrelevant. 


© The Author(s) 2022 
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The GRAVES technique is tool agnostic [11], meaning it can be trained to 
select from any set of verifiers. Our competition contribution selects an ordering 
from the techniques utilized by CPAchecker [3], similar to PeSCo in previous 
competitions. 

To form its selection, GRAVES-CPA produces a graph representation of a 
given program, G, which is based on its AST with control flow, data flow, and 
function call and return edges added between the tree's nodes. The AST's nodes 
and edges ensure the semantics of the statements in the program are maintained. 
Control flow edges maintain the branching and order of execution between these 
statements. Data flow edges explicitly relate the definitions, uses, and interac- 
tions of values in the program. G is passed to a GNN, consisting of a series 
of GATs, which outputs a graph feature vector This feature vector is finally 
passed to a fully connected neural network which decides the sequence in which 
GRAVES-CPA’s suite of verification techniques are run. 


2 System Architecture 


2.1 Graph Generation 


To generate a graph from a program, GRAVES-CPA relies on the AST produced 
by the C compiler Clang [10]. Using a visitor pattern [9], GRAVES-CPA walks the 
AST to generate data flow edges and the edges of the program’s Interprocedural 
Control Flow Graph (ICFG). Function call and return edges in the ICFG are 
those which can be determined purely syntactically. Using the ICFG and data 
flow edges, GRAVES-CPA produces additional data flow edges using the work- 
list reaching definition algorithm [1]. We limit the number of iterations of the 
reaching definition algorithm, making our data edges an under-approximation of 
possible data flow edges. Once this graph is generated, it is parsed into a list of 
nodes and several edge sets. Nodes represent the AST token which corresponds 
to them using a one-hot encoding. These nodes and edges are used as input to 
the GNN. 


2.2 Prediction 


To form a prediction, GRAVES-CPA uses a GNN, visualized in Figure 1, which 
consists of 2 GAT layers, a jumping knowledge layer [17], and an attention- 
based pooling layer [12]. The GAT layers are crucial to our technique. When 
propagating data through the graph, the attention mechanisms in each layer 
weights edges so information important to predictions is more prominent than 
superfluous data. 

The jumping knowledge layer concatenates intermediate graph representa- 
tions, denoted by A, B, and C, allowing the model to learn from each represen- 
tation. The attention-based pooling layer calculates an attention value for each 
node in the graph. All nodes are weighted by their respective attention values 
and then summed together to form a graph feature vector. The combination of 
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Parameters FC1 bias: FC2 weights: 236 x 236 FC2 weights: 236 x 236 
FC2 weights: § x 157 x 157 FC2 bias: 1x 236 FC2 bias: 1x 236 

FC2 bias: 5 x 157 FC2 weights: 1x 236 FC2 weights: 5 x 236 
FC2 bias: 1x1 FC2 bias: 1x5 


Fig. 1. GRAVES’ uses a GNN comprised of 2 GAT layers, a Jumping Knowledge layer, 
and attention pooling layer. These layers produce a graph feature vector which a 3 
layer prediction network uses to order verifiers for sequential execution. An in depth 
description of this architecture can be found in Leeson et al. [11]. 


GAT layers and the attention-based pool allows the network to weigh the im- 
portance of both edges and nodes when forming the graph feature vector. This 
feature vector is fed to a three layer neural network which decides the sequence 
of tool execution. 

GRAVES-CPA was trained using data collected from running 5 configurations 
of the CPAchecker framework on the verification tasks from SV-COMP 2021. 
Labels for each configuration come from the SV-COMP score the configuration 
would receive for a given program minus a time penalty. Similar to CPAchecker's 
competition contribution, these configurations are symbolic execution [6], value 
analysis [7], value analysis with CEGAR [7], predicate analysis [5], and bounded 
model checking with k-induction [4]. To prevent GRAVES-CPA from overfitting 
to the SV-COMP benchmarks, we train on a subset of the dataset, only utilizing 
2096 of it. Like previous iterations of PeSCo, the network is trained to rank the 
configurations in the order in which they should be executed. 

GRAVES-CPA uses the machine learning libraries Py Torch [13] and Py Torch- 
Geometric [8], an extension of Py Torch for graphs and other irregularly shaped 
data, to implement its machine learning components. GRAVES-CPA is imple- 
mented using a combination of Python, C++, and Java. 


2.3 Execution 
Using the ordering produced by the previous step, CPAchecker is run in a se- 


quential fashion with each verification configuration. If a technique goes past a 
given time limit or fails to produce a result, the next technique is executed. 
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3 Strengths and Weaknesses 


GRAVES-CPA operates on program graphs which are an abstraction of the pro- 
gram. Its underlying model uses this abstraction to learn what software patterns 
a particular verification technique excels at handling. This allows GRAVES-CPA 
to produce a dynamic ordering which should run techniques more equipped to 
the given problem first, reducing run time. In [11], the authors perform a qual- 
itative study which suggests the network learns to rank verification techniques 
using program features an expert would use to decide between techniques. 

In SV-COMP 2022 [2], there were 4,548 problems both GRAVES-CPA and 
CPA-checker reported the correct result. GRAVES-CPA’s dynamic selection of 
CPA-checker's static configuration ordering allowed it to solve these problems 
37 hours faster. Further, GRAVES-CPA was able to solve 142 problems that 
CPAchecker could not, due to resource constraints or other issues. 

Machine learning relies on the fact that training data is representative of the 
real world. If this is not the case, the model can easily make poor predictions. 
These poor decisions can be seen in competition in the 559 instances where 
GRAVES-CPA chooses an ordering that doesn’t produce the correct result, but 
CPAchecker does. In most of these instances, GRAVES-CPA runs out of resources 
or incorrectly predicts the remaining techniques will not produce a correct result. 


4 "Tool Setup and Configuration 


GRAVES-CPA is built on the PeSCo codebase, which in turn is built on the 
CPAchecker codebase, and participates in the ReachSafety and Overall cate- 
gories. It can be downloaded as a fork: https://github.com/will-leeson/cpachecker. 
GRAVES-CPA requires cmake, LLVM, either make or ninja, and ant (a CPAchecker 
dependency) to be built and the python libraries Py Torch and Py Torch-Geometric 
to be executed. To build the project, simply run the shell script setup.sh and 
add our graph generation tool, graph-builder, to your path. Now, you may 
verify a program with GRAVES-CPA using the command: 


scripts/cpa.sh -svcomp22-graves -spec [prop.prp] [file.c] 


5 Software Project and Contributions 


GRAVES-CPA is an open source project developed by the authors at the Uni- 
versity of Virginia. We would like to thank the team behind the PeSCo and 
CPA Checker tools for allowing us to build on their work. 
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Abstract. GWIT is a validator for violation witnesses produced by 
Java verifiers in the SV-COMP software verification competition. GWIT 
weaves assumptions documented in a witness into the source code of a 
program, effectively restricting the part of the program that is explored 
by a program analysis. It then uses the GDART tool (dynamic symbolic 
execution) to search for reachable errors in the modified program. 


1 Introduction 


Software verification tools, like any other software, can contain bugs. Given their 
intended use, i.e., proving the absence of errors in programs, however, bugs in 
verification tools are particularly problematic. On the other hand, verification 
tools can generate certificates for computed verdicts (e.g., counterexamples) that 
can be used to validate verification results. In the SV-COMP competition on 
software verification violation witnesses and correctness witnesses, based on an- 
notated abstract control-flow automata have been established as a standardized 
representation of such certificates [1,2]. Participating verifiers are expected to 
produce witnesses for verdicts and witness validators are used for confirming 
verdicts based on these witnesses. 

In this paper, we present GWIT (as in *Guess What I'm Thinking" or as 
in GDart-based witness validator), a validator of violation witnesses for Java 
programs, based on the GDART tool ensemble [6]. GWIT validates violation 
witnesses by weaving the assumptions documented in a witness into the orig- 
inal program under analysis and checks the restricted program with dynamic 
symbolic execution. 


2 Witness Validation in GWIT 


We illustrate the operation of GWIT for the small example shown in Figure 1: 
In the program, a String value is created nondeterministically before asserting 
that the value of this String value should not be “whoopsy”. This program con- 
tains a reachable error: in case the value ^whoopsy" is returned by the call to 
Verifier.nondetString(O, an assertion violation will be triggered. 


* 'This work has been partially founded by an Amazon Research Award 


(9 The Author(s) 2022 
D. Fisman and G. Rosu (Eds.): TACAS 2022, LNCS 13244, pp. 446-450, 2022. 
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public static void main(String[] args) { 
String s = Verifier.nondetString(); 
assert !s.equals("whoopsy") 


} 


e U NR 


Fig. 1: Small program with reachable error. 


Java verifiers will generate a violation witness in such a case. In SV-COMP, 
witnesses are produced in a standardized format, conceptually based on control- 
flow automata and technically realized as models in the GraphML format [2]. 
Figure 2 shows an excerpt of such a witness for the above example. The witness 
makes an assumption on the state of the program when executing line 2 of the 
example program, namely that variable s has value “whoopsy”. As discussed, 
execution paths on which this assumption holds, will lead to an error. 

GWIT weaves the assumptions from the witness into the original program, 
restricting the number of program paths that have to be explored for finding the 
error. Figure 3 shows the result for our example: a call to Witness.assume(...) 
is generated from the assumption from the witness in Figure 2. The assume 
method wraps potentially many calls to the Verifier.assume(...) method, 
enabling multiple assumptions on the same line of code (e.g., due to execution 
of that line in a loop). The counters array keeps statistic on assumptions per 
line. The Verifier.assume(...) method is used by GDART to stop analysis 
on paths that violate the corresponding assumption. 

Figure 4, finally, shows the effect of weaving the witness into the code on the 
obtained constraints-trees. In the left of the figure, the tree computed by GDART 
for the original program is shown. The tree has two satisfiable paths, branching 
on the condition of the assert statement. The right of the figure shows the tree 
for the modified program. This tree contains a node for the assumption, one path 
that is not executed after the violation of the assumption, one path that is not 
feasible after the assumption for the assert statement, and one path leading to an 
error (i.e., assertion violation). In this small example, the tree for the modified 
program is more complex than the tree for the original program, but it has fewer 
complete execution paths. In more complex programs, assumptions will typically 
remove multiple execution paths, making the validation task significantly easier 
than the original verification task. 


«edge source="n0" target-"n1"» 
<data key="originfile">Main. java</data> 
<data key="startline">2</data> 
<data key="threadId">0</data> 
<data key="assumption">s. equals ("whoopsy")</data> 
<data key="assumption.scope">...</data> 
</edge> 


Fig. 2: Excerpt of violation witness produced by GDART or JBMC. 
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static int[] counters = new int[] { 0 }; 


1 

2 public static void assume(int id, boolean ... assumptions) { 
3 int idx = counters[idl; 

4 counters [id]++; 

5 Verifier.assume (assumptions [idx] ) ; 

e Jj 

7 

s public static void main(String[] args) { 
9 String s = Verifier.nondetString() ; 

10 Witness.assume(0, s.equals("whoopsy")); 
11 assert !s.equals("whoopsy") 

12 } 


Fig. 3: Program with assumption from witness weaved into the code. 


4 “ah 7 assume 
S W. oopsy s= "whoopsy" 
iA Ngase me \ false 
err N . 
ok s Æ "whoopsy" assumption 
j* violation 
true” Nes 
unsat err 


Fig. 4: Constraints-tree for original program (left) and modified program (right). 


3 Performance and Limitations 


While the approach of GWIT is sound for violation witnesses, the current imple- 
mentation still has limitations, validating roughly half of the witnesses provided 
by verifiers. 


Soundness. GWIT is sound: weaving a witness into the code adds additional 
decision nodes to the constraints-tree. In the sub-tree rooted at such a new node, 
some paths become unsatisfiable and will not be explored. Every complete path 
w in the modified tree has an equivalent path $ in the original constraints-tree 
such that Ý => dq. If an error is reached in the modified tree, it is also reachable 
in the original program. 


Performance. For programs with few decisions, the modified program may actu- 
ally be more complex than the original program, but GDART does only explore 
more paths than in the original program in cases where the initial value along 
some path does not satisfy an assumption. Comparing the CPU times of GDART 
used as a verifier and used through GWIT, using almost identical configuration 
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options (only difference: GWIT does not produce witnesses), complexity is re- 
duced for most benchmark instances that do not fail due to syntactic errors 
during weaving (see below). 

Two extreme examples are the BellmanFord-FunSat02 for which weaving a 
witness with 13 assumptions increases CPU time more than twice, leading to a 
timeout during validation and the nanoxml_eqchk/prop2 instance for which the 
CPU time required for validation is less than 14% of the CPU time needed for 
the original verification task. 

Overall, GWIT successfully validates 301 of 614 witnesses provided by GDART 
and JBMC [3] (the only JAVA verifiers that currently produce witnesses). In 286 
cases, validation failed with inconclusive verdicts due to currently unsupported 
features of witness. In 15 cases, incorrect weaving (see below) prevented valida- 
tion of witnesses. For 12 witnesses, validation attempts exhaust resource limits. 


Limitations. First, GWIT currently only supports violation witnesses. In princi- 
ple, it should be possible to validate verification witnesses by weaving assertions 
into the program code, but it is not obvious that such an approach makes the 
validation of witnesses a simpler problem than the original verification task. Sec- 
ond, since weaving witnesses is done on the source code, it only works correctly 
on proper blocks, delimited with braces, and with one statement per line. While 
this does not affect soundness, it makes the validation of witnesses impossible in 
some cases. 


4 Tool Setup 


GWIT is shipped as a git repository with sub-projects delivering all required 
components. Checking out the repository and initializing all sub-projects pulls 
in all required source code. For building the SPouT component, the mx build 
system? maintained by the GraalVM [7] team is required. Other components are 
built with maven. Once all build systems are available, the ./compile-all.sh 
script builds GWIT. The ./run-gwit.sh is used to validate witnesses, taking 
the witness file and source folders of a benchmark instance as parameters. GWIT 
currently does not expose any other configuration parameters. 


5 Software Project 


The GWIT tool is available on GitHub*. GWIT’s scripts are licensed under the 
Apache 2.0 license. The sub-project bring their own license as follows: DSE? is 
available under the Apache 2.0 license, JCoNsTRAINTS [4] as well, and SPOUT” 
is available under the GPL v2 license. The components of GWIT and GWIT 
itself are currently developed at TU Dortmund by the group led by Falk Howar. 


3 https: //github.com/graalvm/mx 

^ https: //github.com/tudo-aqua/gwit 

5 https: //github.com/tudo-aqua/dse 

6 https: //github.com/tudo-aqua/jconstraints 
T https: //github.com/tudo-aqua/spout 
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6 Data Availability Statement 


The GWIT archive used for SV-COMP 2022 is available at Zenoodo [5]. 


References 


1. Beyer, D., Dangl, M., Dietsch, D., Heizmann, M.: Correctness witnesses: Exchang- 
ing verification results between verifiers. In: Proc. FSE. p. 326337. FSE 2016, 
Association for Computing Machinery, New York, NY, USA (2016). https://- 
doi.org/10.1145/2950290.2950351 

2. Beyer, D., Dangl, M., Dietsch, D., Heizmann, M., Stahlbauer, A.: Witness validation 
and stepwise testification across software verifiers. In: Proc. FSE. p. 721733. ES- 
EC/FSE 2015, Association for Computing Machinery, New York, NY, USA (2015). 
https://doi.org/10.1145 /2786805.2786867 

3. Cordeiro, L., Kroening, D., Schrammel, P.: JBMC: Bounded model checking for 
Java bytecode. In: Beyer, D., Huisman, M., Kordon, F., Steffen, B. (eds.) Proc. 
TACAS. pp. 219-223. Springer International Publishing, Cham (2019). https://- 
doi.org/10.1007/978-3-030-17502-3 17 

4. Howar, F., Jabbour, F., Mues, M.: JConstraints: A library for working with logic 
expressions in Java. In: Models, Mindsets, Meta: The What, the How, and the Why 
Not?, pp. 310-325. Springer (2019). https://doi.org/10.1007/978-3-030-22348-9 19 

5. Howar, F., Mues, M.: Gwit artifact for sv-comp 2022 (Feb 2022). https://- 
doi.org/10.5281/zenodo.5956885 

6. Mues, M., Howar, F.: GDart: An ensemble of tools for dynamic symbolic execu- 
tion on the java virtual machine (competition contribution). In: Proc. TACAS (2). 
Springer (2022) 

7. Würthinger, T., Wimmer, C., Wöß, A., Stadler, L., Duboscq, G., Humer, C., 
Richards, G., Simon, D., Wolczko, M.: One VM to rule them all. In: Proc. SPLASH. 
pp. 187-204 (2013) 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http:/ /creativecommons.org/licenses/by /4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 

The images or other third party material in this chapter are included in the chapter’s 
Creative Commons license, unless indicated otherwise in a credit line to the material. If 
material is not included in the chapter's Creative Commons license and your intended 
use is not permitted by statutory regulation or exceeds the permitted use, you will need 
to obtain permission directly from the copyright holder. 


The Static Analyzer Infer in SV-COMP i 


(Competition Contribution) 


Matthias Kettl(=)® and Thomas Lemberger® 


LMU Munich, Germany 


Abstract. We present Inrer-sv, a wrapper that adapts Inrer for SV- 
COMP. Inrer is a static-analysis tool for C and other languages, developed 
by Facebook and used by multiple large companies. It is strongly aimed 
at industry and the internal use at Facebook. Despite its popularity, there 
are no reported numbers on its precision and efficiency. With Inrer-sv, 
we take a first step towards an objective comparison of Inrer with other 
SV-COMP participants from academia and industry. 


1 Facebook Infer 


INFER [6] is a compositional and incremental static-analysis tool developed at 
Facebook. INFER supports a wide array of analyses; this includes memory safety, 
buffer overruns, performance constraints and different reachability analyses for 
C, C++, Objective C, Java, C#, and .Net. For memory analysis, INFER uses 
bi-abduction [7] with separation logic [14]. INFER supports the integration of 
new abstract domains through the abstract-interpretation framework Infer: AI. 
INFER analyzes programs compositionally (building method summaries) and 
incrementally (only analyzing changed program parts). In contrast to most other 
tools that participate in SV-COMP, INFER is not an academic verifier. Instead, it is 
aimed at practical use during software development. This has direct implications 
on the development focus: When IwrzR is told to incrementally analyze software, 
it outputs only newly discovered bugs and does not re-report bugs found in 
previous analyses. This allows developers to ignore warnings not deemed relevant 
and reduces the cognitive burden on developers due to false alarms. Multiple 
large companies use INFER—among others: Amazon Web Services, Facebook, 
Microsoft, Mozilla, and Spotify. At the time of this writing, INFER has more than 
12000 stars on GitHub and was forked over 1500 times. Despite its popularity, 
there are no reported numbers on INrER's precision and soundness. With the 
participation of INFER in the C language track of SV-COMP 722, we hope to take 
a first step towards an objective comparison of INFER with other verifiers. 

The following other commercial verifiers participate in SV-COMP 722: 2rs [16], 
Cpmc [10], Crux !, Frama-C [5], VertAss [12], and VeriFuzz [9]. 


1 https://crux.galois.com/ 


© The Author(s) 2022 
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2 Infer in SV-COMP 
2.1 Infer-SV 


Verification. We provide the wrapper INrER-sv to adapt INFER to the SV-COMP 
specification format for program properties. INFER-sv parses the property to 
analyze, adjusts the program under analysis for INFER, runs INFER with fitting 
analyses, and reports a verification verdict based on the feedback produced by 
INFER. INFER-sv supports the following SV-COMP program properties: 


no-overflow. The aim is to check for arithmetic overflows on signed-integer types. 
INFER-sv runs INFER's buffer-overrun analysis? to detect these. 


unreach-call. The aim is to check for reachable calls to function reach, error. 
INFER provides a function-call reachability analysis ?, but this analysis proved very 
imprecise. To mitigate this, INFER-sv performs a program transformation ^: It 
replaces each call to function reach, error with an overflow-provoking statement 
int reach error x = Ox7fffffff + 1. No task with property unreach-call 
contains a signed-integer overflow, so the original reachability property holds if 
and only if any of the introduced overflows is reachable. INFER-sv runs INFER’s 
buffer-overrun analysis on the transformed program to check this. 


valid-memsafety. The aim is to check for invalid pointer dereferences, invalid 
frees of memory, and memory leaks. To analyze memory safety, INFER-sv uses 
two analyses: bi-abduction? and Infer:Pulse?. SV-COMP requires verifiers to 
report the concrete type of violation detected: valid-deref, valid-memtrack, or 
valid-free. INFER-sv analyzes the error codes reported by INFER to determine the 
exact violation found. If INFER reports multiple fitting warnings, we take the first. 


Witnesses. SV-COMP requires participants to report GraphML verification- 
result witnesses [3, 4| in tandem with each result, and these witnesses must be 
successfully validated by at least one participating witness validator. Natively, 
INFER does not support the generation of GraphML witnesses. To mitigate this, 
INFER-SV creates generic witnesses: When reporting a violation, it generates a 
violation witness [4] that represents all possible program paths. When reporting a 
program safe, it generates a correctness witness [3] that only contains the trivial 
invariant ‘true’. These witnesses do not helpfully guide towards a violation or 
proof, but are valid according to the SV-COMP rules. 


Participation. INrER-sv participates hors concours in the categories ReachSafety, 
ConcurrencySafety, NoOverflows, and SoftwareSystems. Because of missing sup- 
port, we exclude INFER-sv from categories aimed at float handling, as well as 
category MemSafety-MemCleanup. 


? https:/ /fbinfer.com/docs/checker-bufferoverrun 

3 https:/ /fbinfer.com/docs/checker-annotation-reachability 
^ https:/ /github.com/facebook /infer /issues/763 

5 https:/ /fbinfer.com/docs/checker- biabduction 

6 https:/ /fbinfer.com/docs/checker- pulse 
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Fig. 1: Comparison of the run time (in CPU time seconds) of three SV-COMP ’22 
medalists and INFER, across all tasks correctly solved by the respective pair 


1 aint main() { 1 void reach error() { 
2 if (0) { 2 int x = Ox7fffffff + 1; 
3 int x = Ox7fffffff + 1; 3 } 
4 } 4 int main() { 
5 Y 5 if (0) { 
6 reach. error(); 
T 
8 } 
(a) Inver correctly reports safety (b) Inrer incorrectly reports an alarm 
1 int main() { 1 int main() { 
2 int x = OxTfffffff; 2 int x = Ox7fffffff; 
3 int y = -1; 3 int y = -1; 
4 while (x > 0) { 4 while (x > 0) { 
5 x = x - 2*y; 5 x = x - 2*y; 
6 } 6 y= y + 2; 
7 } 7 } 
8 } 
(c) Inrer correctly reports an alarm (d) Inrer incorrectly reports safety 


Fig. 2: Examples of INFER’s inconsistent results 


2.2 Strengths of Infer 


INFER scales well [6]. This shows in the SV-COMP results: For 6000 out of 
8000 tasks with a verification verdict, INFER finishes the analysis in less than 
one second of CPU time. The remaining 2000 tasks each take less than 100s 
of CPU time. This means that INFER stays significantly below the time limit of 
900s per task. Figure 1 compares the run time of INFER (in CPU-time seconds) 
to the best SV-COMP 722 tools in the categories that INFER participated in: 
CPACHECKER [11], SYMBIOTIC [8], and VERIABS [12]. Each plot shows the 
run time for all tasks that are correctly solved by both Inrer and the respective 
other verifier (independent of result validation). It is visible that INFER (y-axis) is 
significantly faster than the other tools (x-axis) for almost all tasks. This speed 
makes INFER integrate well in continuous-integration development systems [13, 15]. 
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2.3 Weaknesses of Infer 


INFER demonstrates low analysis precision. Figures 2a and 2b illustrate a low 
precision across function calls (intraprocedural analysis): Both programs contain 
an unreachable, signed integer overflow. The only difference is the indirection in 
Fig. 2b due to the additional function call. INFER correctly reports Fig. 2a safe, 
but incorrectly reports an alarm for Fig. 2b. We assume that the intraprocedural 
analysis of INFER does not check whether reach error is reachable from the 
program entry. INFER-sv mitigates this issue for property unreach-call through 
the mentioned program transformation, but this imprecision still leads INFER to 
report wrong alarms across all program properties. 

INFER can also show imprecision within a single function. Consider Figs. 2c 
and 2d: The only change between Fig. 2c and Fig. 2d is the addition of a 
statement in line 6, y = y + 2. This has no influence on the integer overflow in 
line 5, so both programs contain an overflow. INFER correctly reports the overflow 
for Fig. 2c, but wrongly reports Fig. 2d safe. 

These imprecisions strongly reflect in the SV-COMP results of INFER, leading 
to many incorrect proofs and alarms. 


3 Usage 


INFER-SV requires Python 3.6 or later. Script setup.sh downloads and extracts 
version 1.1.0 of INFER. From the tool’s directory, INFER-sv can be run with the 
following command: 


./infer-wrapper.py \ 
--data-model {ILP32 or LP64} \ 
--property path/to/property.prp \ 
--program path/to/program.c \ 


Setting the data model is optional. INFER-sv will print the recognized property 
and the command line it uses to call INFER. INFER-sv prints the full output of 
INFER, including all warnings, and the final verification verdict on the last line. 
The verification verdict can be true, false, unknown or error. 


4 Conclusion 


The participation of INFER in SV-COMP allows an objective comparison with 
other verifiers for C. This shows that the selected analyses of INFER are very 
efficient, but suffer from strong imprecision on the considered benchmark tasks. 
Contributors. INFER " is developed by Facebook and the open-source community 


under the MIT license, and INrEn-sv ? is developed under the Apache 2.0 license at 
the Software and Computational Systems Lab at LMU Munich, led by Dirk Beyer. 


T https: //github.com /facebook/infer 
5 https: / /gitlab.com/sosy-lab /software /infer-sv 
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Abstract. LART — LLVM abstraction and refinement tool — originates 
from the DIVINE model-checker [b]7], in which it was employed as an 
abstraction toolchain for the LLVM interpreter. In this contribution, we 
present a stand-alone tool that does not need a verification backend but 
performs the verification natively. The core idea is to instrument abstract 
semantics directly into the program and compile it into a native binary 
that performs program analysis. This approach provides a performance 
gain of native execution over the interpreted analysis and allows compiler 
optimizations to be employed on abstracted code, further extending the 
analysis efficiency. Compilation-based abstraction introduces new chal- 
lenges solved by LART, like domain interaction of concrete and abstract 
values simulation of nondeterministic runtime or constraint propagation. 


Keywords: Abstract interpretation - Compilation-based abstraction - 
LLVM - LART - DIVINE - Formal verification - Symbolic execution. 


1 Verification Approach and Software Architecture 


As it is with many tasks in computer science, one can approach them in multiple 
ways, and verification is not an exception. In general, tools approach program 
analysis using an interpretation, giving them complete control over a program 
state and program execution but paying the cost for performance. Our tool LART 
challenges the task utilizing the toolset from the opposite side of the spectrum — 
compilation — using a technique of so-called compilation-based abstraction. The 
main idea of this approach is to compile nondeterministic execution directly 
into the executable and perform reachability analysis by its native execution. 
This approach is most similar to one presented in symcc [6]. Symcc performs a 
compilation of symbolic execution into the native binary. In contrast, we present 
a more general approach that allows arbitrary abstraction. Spin model checker 
[4] also provides a mode where the model is compiled together with a verifier to 
a single executable. 

During the compilation, LART performs LLVM-to-LLVM transformation to aug- 
ment instructions that can manipulate with nondeterministic values. This is 


* This work has been partially supported by Red Hat, Inc. 
** Jury member representing LART at SV-COMP 2022. 
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a purely syntactic abstraction of a program, e.g., add instruction is replaced 
by call to _lart_add. Additionally, LART provides a set of semantic libraries 
(abstract domains) to give meaning to abstract instruction. Each abstract do- 
main defines the native representation of abstract values, implements abstract 
instructions and transformations to and from concrete values and other domains. 
The tool provides multiple domains that allow analyses with various precisions, 
e.g., interval analysis, nullity analysis, or symbolic analysis. Finally, to allow 
native execution, domains are present in static libraries linked to instrumented 
programs under test. 

In comparison to concrete programs, abstracted programs also exhibit non- 
deterministic control flow. To explore all possible execution paths, LART provides 
a configurable runtime library. The overall architecture of compilation-based ab- 
straction is depicted in figure 

The configuration used in the competition contribution employs an itera- 
tive deepening search of program paths. At each branching point of a program, 
the execution forks to explore all possibilities. Finally, the main process of the 
analysis gathers results from explored paths and notifies the user if an error 
is reachable. This approach eventually suffers from potential infinite loops and 
path explosion problem. However, it is sufficient for bug hunting or even verifi- 
cation in the case of employed overapproximative abstraction, which widens the 
effect of infinite loops. Also, in many simple cases, a compiler can summarize 
the effects of program loops, minimizing the impact of path explosion. 
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Fig. 1. LART architecture overview. 


In order to obtain a performant result, we strive to minimize the amount 
of syntactic abstraction. Instrumentation achieves this by combining forward 
dataflow analysis and Andersen alias analysis [1], tainting only those instructions 
that might encounter nondeterministic values, and abstracting only the tainted 
instructions. This analysis is entirely overapproximative and detects all possible 
candidates for abstraction quickly. The actual abstract computation is resolved 
later during execution. 

However, we don't want to perform expensive abstract computation when 
tainted instructions do not obtain nondeterministic operands. This might occur 
when a C function at one point receives concrete arguments and at another call 
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site some abstract arguments. In the former case, we would like to perform it fully 
concretely. While in the latter, we want to execute only the necessary amount 
of tainted instructions abstractly. Therefore, LART synthesizes simple dispatch 
routines that pick a concrete or abstract instruction depending on the operands. 
'The dispatch routine also handles the possibility of mixing concrete and abstract 
operands - lifting concrete values to an abstract domain if necessary. We require 
that all operands of abstract instructions are in the same domain. See an example 


of dispatch in 


-.lart value ..lart dispatch.add( .lart value a, . lart value b) { 
if (is abstract(a) |l is abstract(b)) i 
if (lis abstract (a)) 
a.abstract = lift(a.concrete); 
else if (lis abstract(b)) 
b.abstract = lift(b.concrete); 
return domain::add(a.abstract, b.abstract); 


Y 


return a.concrete + b.concrete; 


} 


Fig. 2. Syntactically abstracted values in LART are represented in union type of an 
abstract or concrete type (_lart_value). The dispatch routine lifts operands to an 
abstract domain and resolves in which domain the instruction should be executed. 
Since the abstraction dispatch is purely syntactic, it can be inlined to abstracted source 
code and further optimized. This gives the compiler a possibility to optimize repeated 
checks in dispatch routines. 


The runtime for native execution takes care of multiple responsibilities. First 
of all, it implements an execution fork when a branch is conditioned by the 
abstract value that results in both possibilities, e.g., when a branch is conditioned 
on symbolic term x < 5, both outcomes are possible. Furthermore, the runtime 
takes care of memory management of abstraction. To not disrupt the original 
program’s memory layout, LART keeps all abstract data in a shadow memory. 
Therefore the union values presented in [Figure 2|are split into two separately 
addressed regions — concrete program memory and abstract shadow memory. 
The information on whether variables hold an abstract value is also kept in the 
shadow memory. 


2 Strengths and Weaknesses 


The main strength of the compilation-based abstraction is in the utilization of 
native runtime and compiler optimizations on abstracted code. From theory, the 
native execution should consistently outperform the same interpreted analysis. 
However, it comes at the cost of a more complex source transformation that is 
harder to relate to its origin. Furthermore, the overapproximative nature of the 
syntactic analysis produces unnecessary execution of dispatch functions when 
not needed. In contrast, an interpreter can compute in specific domain without 
additional dispatches. Another advantage of the approach is a reusable result 
of syntactic abstraction that can be linked with various domains to perform 
analysis concurrently without repeated LLVM instrumentation. 
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'The best comparison of LART is with the DIVINE model-checker, which uses 
LART’s transformation and domain libraries internally, but instead of compiling 
to native executable, it interprets abstracted LLVM IR. Results from the com- 
petition support the hypothesis that the compilation-based approach of LART 
outperforms DIVINE in all reachability subcategories, except one where longer 
times are caused by different state space exploration order. 

Given the simplistic runtime, abstracted binaries produced by LART lack 
further analysis optimizations and verification capabilities. Presently, the explo- 
ration algorithm only supports reachability analysis of single-threaded programs. 
However, we plan to support memory safety and overflow checking using sani- 
tizers like approach. 

Another goal of LART’s compilation-based approach is to provide a reusable 
abstraction component for verification tools. The proof of this concept is shown 
on DIVINE and now on the native mode that can be analyzed by standard pro- 
grammer toolset, like debuggers or sanitizers. 


3 Tool Setup and Configuration 


The verifier archive can be found on the sv-coMP 2022 [2] page under the name 
LART. In case the binary distribution does not work on your system, we also pro- 


vide a source distribution and build instructions at |https:/ /github.com/xlauko/ 
lart /tree/svcomp-2022, It is sufficient to run LART using compiler wrapper script 


as follows: lartcc «domain» testcase.c -o abstract and then execute the 
abstract binary to perform the analysis. 

For SV-COMP contribution, LART wrapper handles additional settings and 
setup of workflow presented in [Figure 1| The wrapper sets LART options based 
on the property file and the benchmark. In particular, LART enables symbolic 
mode if any nondeterminism is found, and it sets which errors should be reported 
based on the property file. It also generates witness files. More details can be 
found on the aforementioned distribution page. Due to support limitation LART 
participates only in ReachSafety and DeviceDrivers categories. 


4 Software Project and Contributors 


The project home page is https://github.com/xlauko/lart, The LART is open 


source software distributed under the MIT license. Active contributors to the 
tool are listed as authors of this paper. 


Data Availability Statement. All data of SV-COMP 2022 are archived as described 
in the competition report and available on the [competition web site| This includes 
the verification tasks, results, witnesses, scripts, and instructions for reproduction. 
The version of our verifier as used in the competition is archived together with other 
participating tools [3]. 
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Abstract. The development of SYMBIOTIC 9 focused mainly on two 
components. One is the symbolic executor SLOWBEAST, which newly 
supports backward symbolic execution including its extension called loop 
folding. This technique can infer inductive invariants from backward sym- 
bolic execution states. Thanks to these invariants, SYMBIOTIC 9 is able 
to produce non-trivial correctness witnesses, which is a feature that is 
missing in previous versions of SvMBIOTIC. We have also extended for- 
ward symbolic execution in SLOWBEAST with a basic support for par- 
allel programs. The second component with significant improvements is 
the instrumentation module. In particular, we have extended the static 
analysis of accesses to arrays with features designed for programs that 
manipulate C strings. 

SYMBIOTIC 9 is the Overall winner of SV-COMP 2022. Moreover, it won 
also the categories MemSafety and SoftwareSystems, and placed third in 
Falsification Overall. 


1 Verification Approach 


SYMBIOTIC 9 combines fast static analyses with code instrumentation and pro- 
gram slicing [13] to speed up the code verification. In the SV-COMP configura- 
tion of SYMBIOTIC 9, the code verification is performed by symbolic executors, 
namely by SLOWBEAST [8] and our fork of KLEE [4]. 

As SYMBIOTIC works internally with LLvM [10], it first compiles the given C 
program into LLVM bitcode. The following steps depend on the verified property. 


Verification of the Property unreach-call For this property, SYMBIOTIC 9 
directly slices the LLVM bitcode to remove instructions that have no influence 
on the reachability of error calls and then run KLEE with the time limit of 
333 seconds. KLEE is very efficient and often decides the task within this time 
limit. If KLEE fails to decide, we parse its output and proceed according to the 
case of the failure. If KLEE failed because the program contains threads, we 


* This work has been supported by the Czech Science Foundation grant GA19-243978. 
= Jury member and the corresponding author: chalupa@fi.muni.cz 


(€) The Author(s) 2022 
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Table 1. The comparison of supported features of KLEE (our fork and the upstream) 
and SLOWBEAST (SV-COMP 2022 and SV-COMP 2021 versions). The marks v /// /X 
mean supported/partially supported/unsupported. 


KLEE KLEE SLOWBEAST SLOWBEAST 
upstream our fork | SV-COMP 2021 SV-COMP 2022 
Backward SE x x X v 
Loop folding x x V 
Invariant generation x x V v 
Symbolic floats x x v v 
Symbolic pointers V y X 
Symbolic-sized allocations x y X 
Symbolic addresses X v x 
Parallel programs x x x 
Incremental solving x x v 
Caching solver calls V 4 x x 
Lazy memory x x v v 


run SLOWBEAST with forward symbolic execution (SE) and the threads support 
turned on. If KLEE failed for any other reason, we run SLOWBEAST with backward 
symbolic execution with loop folding (BSELF) [8] described later. If BSELF also 
fails (the current implementation supports only selected program features), we 
run SLOWBEAST with forward symbolic execution. 

Note that running forward symbolic execution first with KLEE and then with 
SLOWBEAST if KLEE fails does make a good sense as KLEE and SLOWBEAST sup- 
port a different set of features. The main differences between these tools (and 
the upstream KLEE and the version of SLOWBEAST used in SYMBIOTIC 8) are 
summarized in Table 1. Row symbolic addresses indicates whether tools model 
the non-determinism in the placement of allocated objects (this is useful, e.g., 
when comparing addresses of such objects). Row incremental solving indicates 
whether tools can associate the state of an SMT solver to every symbolic execu- 
tion state and incrementally add constraints instead of always solving formulas 
from the scratch. Row caching solver calls indicates whether tools can remem- 
ber results of solver calls and use them later to quickly decide some other solver 
calls. Finally, row lazy memory indicates if the tool can create memory objects 
on-demand when first accessing them, without their previous allocation (it as- 
sumes that the accesses to memory are valid). This feature is crucial when we 
want execute a program by parts, without starting from the entry point. The 
meaning of the remaining rows should be clear or is explained later. 

If an error is found by either tool, it is replayed on the unsliced code. If the 
replay succeeds, we generate a violation witness. If no error is found and the anal- 
ysis was complete, we generate a correctness witness. If the program correctness 
was proved by SLOWBEAST with BSELF, we generate a witness containing the 
computed invariants, otherwise we generate a trivial correctness witness as we 
have no invariants at hand. In all other cases, SYMBIOTIC 9 answers unknown. 
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Verification of Other Properties For verification of other properties than 
unreach-call, SYMBIOTIC 9 uses the same workflow as SYMBIOTIC 8 [7]. In 
brief, the instrumentation module marks program instructions that can po- 
tentially violate the considered property. The module employs suitable fast 
static analyses to identify these instructions (e.g., when checking the property 
no-overflow, it uses a range analysis to discover the instructions that may per- 
form a signed integer overflow). The bitcode with marked instructions is sliced 
such that the arguments and the reachability of these instructions are preserved. 
'The sliced bitcode is passed to KLEE. If it discovers a property violation and 
then replays it on the unsliced code, we produce a violation witness. If KLEE 
completes its analysis without any property violation found, we produce a trivial 
correctness witness. In all other cases, SYMBIOTIC 9 returns unknown. 


Backward Symbolic Execution with Loop Folding (BSELF) [8] SLow- 
BEAST newly implements backward symbolic execution (BSE) [9], which explores 
the program backward from target locations towards the initial location and 
incrementally computes weakest preconditions for the explored program paths. 
BSE is a valuable technique on its own as it precisely corresponds to k-induction 
on control-flow paths [8]. Loop folding is a technique that aims to infer induc- 
tive invariants during BSE. Roughly speaking, when BSE starts from an error 
location and reaches a loop header, loop folding creates an initial invariant can- 
didate that is disjoint with the current weakest precondition (i.e., the states that 
can reach the error location). If the invariant candidate is actually an invariant, 
we know that the error location is not reachable via the explored path. Oth- 
erwise, a pre-image of the invariant candidate along a loop path is computed, 
over-approximated, and added to the candidate. This process is repeated until an 
invariant is found or until it fails for some reason, e.g., when it discovers that the 
error location is actually reachable. Loop folding can infer complex disjunctive 
invariants and since it uses the error states, it is also property-driven. 


String Analysis and Other Improvements The second major improvement 
in SYMBIOTIC 9 is in the instrumentation for the property valid-memsafety. We 
have improved the analysis for the identification of out-of-bounds array accesses. 

In Symbiotic 8, this analysis only determined whether an array access done 
via the index variable is in bounds [14]. The analysis in Symbiotic 9 also handles 
more general patterns where the array contains a concrete value (0 in the case 
of C strings) and the index pointer is incremented by one until it points to this 
concrete value, and where the pointer is incremented a fixed number of times. 

Further, we have extended the forward symbolic execution in SLOWBEAST to 
handle parallel programs. For now, the symbolic execution is highly inefficient 
as it examines each interleaving of globally visible events. We plan to implement 
some reductions in the future. SLOWBEAST has been also extended to generate 
witnesses as this functionality was missing. Notably, it can generate non-trivial 
correctness witnesses using the invariants computed by BSELF. Previous ver- 
sions of SYMBIOTIC generate only trivial correctness witnesses. 
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Slicing has been also improved. It now applies a fast and coarse slicing be- 
fore the main slicing. The coarse slicing detects all basic blocks from which no 
slicing criterion (i.e., an instruction whose reachability and arguments should 
be preserved) is syntactically reachable and replaces them by calls to abort. 


2 Strengths and Weaknesses 


Forward symbolic execution is unable to fully analyze unbounded loops or in- 
finite execution paths. Hence, unless program slicing removes the unbounded 
computation from the program, forward symbolic execution cannot verify it. 
However, backward symbolic execution and BSELF can fully analyze at least 
some unbounded programs [8]. Still, both these methods are computationally 
complex as the number of paths they must search may be enormous and their 
exploration may involve many non-trivial calls to the SMT solver. Therefore, 
these methods do not scale to real-world programs. 

A strong aspect of SYMBIOTIC is the very interplay of fast static analyses 
in the instrumentation, program slicing, and forward and backward symbolic 
execution. Fast static analyses are able to deem correct many parts of the code 
(with respect to the verified property). These parts of the code are then usu- 
ally removed by slicing and only the possibly unsafe parts of the program (and 
their dependencies) get into a symbolic executor. In this sense, SYMBIOTIC does 
incremental or conditional [3] verification. 


Results of Symbiotic 9 in SV-COMP 2022 In SV-COMP 2022 [1], Svw- 
BIOTIC 9 won categories MemSafety, SoftwareSystems, and Overall, and got the 
3rd place in FalsificationOverall. Moreover, it produced 1529 correct answers 
that were not confirmed, which is the highest number in SV-COMP 2022. 1073 
unconfirmed answers are in MemSafety-Juliet, where we produced some incorrect 
witnesses due to a bug. Another 258 unconfirmed answers are in Termination. 
SYMBIOTIC 9 produced only 3 incorrect answers caused by a bug in the replay 
mode of SLOWBEAST. 


3 Software Project and Contributors 


All components of SYMBIOTIC 9 use LLVM 10 [10]. Slicer and instrumentation 
module are written in C++ and extensively use the library DG [5]. KLEE is 
implemented in C++ and SLOWBEAST [12] is written in Python. Both symbolic 
executors use Z3 [11] as the SMT solver. Control scripts are written in Python. 

SYMBIOTIC 9 and all its components and external libraries are available under 
open-source licenses that comply with SV-COMP’s policy for the reproduction 
of results. SYMBIOTIC 9 participated in all categories of SV-COMP 2022 except 
the categories with Java programs. 

SYMBIOTIC 9 has been developed by Marek Chalupa, Vincent Mihalkovié, 
Anna Rechtácková, and Lukáš Zaoral under the supervision of Jan Strejček. 
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Data Availability Statement. All data of SV-COMP 2022 are archived as described 
in the competition report [1] and available on the competition web site. This includes 


the verification tasks, results, witnesses, scripts, and instructions for reproduction. 


'The version of SYMBIOTIC used in the competition is archived together with other 


participating tools [2] and also in its own artifact [6] at Zenodo. 
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Abstract. SYMBIOTIC- WITCH is a new tool for checking violation wit- 
nesses in the GraphML-based format used at SV-COMP since 2015. 
Roughly speaking, SYMBIOTIC- WITCH symbolically executes a given pro- 
gram with KLEE and simultaneously tracks the set of nodes the witness 
automaton can be in. Moreover, it reads the return values of nondeter- 
ministic functions specified in the witness and uses them to prune the 
symbolic execution. The violation witness is confirmed if the symbolic 
execution reaches an error and the current set of witness nodes contains 
a matching violation node. 

SYMBIOTIC- WITCH currently supports violation witnesses of reachability 
safety, memory safety, memory cleanup, and overflow properties. 


1 Verification Approach 


We present a new checker of violation witnesses called SYMBIOTIC-WITCH. The 
checker first loads a given violation witness in the GraphML format [5] and a 
given program. Then it performs symbolic execution [11] of the program and 
simultaneously tracks the progress of the execution in the witness automaton. 
More precisely, every state of symbolic execution is accompanied by the set of 
witness automaton nodes that can be reached under the executed program path. 
If the symbolic execution detects a violation of the considered property and the 
tracked set of witness automata nodes contains a violation node, the witness is 
confirmed. 

Note that the original description of the witness format [5] does not provide 
any formal semantics of the format. We interpret it in the way that if an edge 
in a witness automaton matches an executed program instructions, then we can 
follow the edge but we can also stay in its starting node. Hence, if we have the 
set of witness automaton nodes reached under a certain program path, then 
prolongation of this path can add some nodes to this set, but it never removes 
any node from the set. A brief reading of an upcoming detailed description of 
the format [4] reveals that it can be the case that an edge matching an executed 
program instruction has to be taken. If this is indeed the case, we will adjust 
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our tool, but the current implementation and the following texts consider the 
former semantic. 

Before SYMBIOTIC-WITCH starts the symbolic execution, we remove from 
the witness automaton all nodes that are not on any path from the entry node 
to a violation node. In general, witness automata are related to program exe- 
cutions using node and edge attributes. SYMBIOTIC- WITCH currently supports 
only some attributes of witness edges to map a program execution to a given 
witness automaton. Namely, it uses the line number of executed instructions, the 
information whether true or false branch is taken, and the information about 
entering a function or returning from a function. Additionally, if the witness au- 
tomaton contains a single path from the entry node to a violation node and there 
is some information about return values of the | VERIFIER nondet * functions 
on this path, then we use these values in the symbolic execution of the program. 
Return values not provided in the witness are treated as symbolic values. 

A more precise description of the approach can be found in the bachelor’s 
thesis of P. Ayaziová [1]. 


2 Software Architecture 


The approach has been implemented in a tool called SYMBIOTIC- WITCH, which 
is basically a modification of the symbolic executor KLEE [8]. More precisely, 
it is derived from the clone of KLEE used in SYMBIOTIC, which employs the 
SMT solver Z3 [13] and supports symbolic pointers, memory blocks of symbolic 
sizes etc. For parsing of witnesses in the GraphML format, we use the library 
RAPIDXML. 

As KLEE executes programs in LLVM [12], a given C program has to be 
translated to LLVM first. We use CLANG for this translation as explained in 
Section 4. 

The current version of SYMBIOTIC- WITCH runs on LLVM version 10.0.0. 


3 Strengths and Weaknesses 


Existing violation witness checkers (excluding DARTAGNAN [10] designed for con- 
current programs) can be roughly divided into two categories. 


— CPA-WITNESS2TEST [6], FSHELL-WITNESS2TEST [6], and Nrrwrr [14] per- 
form one program execution based on the information in the witness. If this 
execution violates the specification, the witness is confirmed. This approach 
is very efficient for witnesses fully describing one program execution that vi- 
olates the property. However, if a witness describes more program executions 
and only some of them violate the property, these tools can easily miss the 
violating executions. In particular, if a witness does not specify some return 
value of a __VERIFIER_nondet_* function, FSHELL-WITNESS2TEST uses the 
default value 0, NrrwIT picks a random value, and CPA-WITNESS2TEST 
fails the witness confirmation. 
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— CPACHECKER [5], ULTIMATEAUTOMIZER [5], and METAVAL [7] create a 
product of a given witness automaton and the original program and analyze 
it. As a result, some execution paths of the original program can be ana- 
lyzed repeatedly for different paths in the witness automaton. To suppress 
this effect, these checkers usually ignore the possibility to stay in a witness 
automaton node whenever there is a matching transition leaving the node. 
Unfortunately, a valid witness can be unconfirmed due to this strategy. 


We believe that our approach to checking violation witnesses removes all 
mentioned disadvantages. Symbolic execution allows us to efficiently examine 
many program executions corresponding to a given witness automaton, and pro- 
gram executions are not analyzed repeatedly. The approach can easily handle 
witnesses based on return values from the __VERIFIER_nondet_* functions as 
well as those based on description of branching. 

There is only one principal case when a valid witness is not confirmed by 
SYMBIOTIC-WITCH (ignoring the cases when SYMBIOTIC-WITCH simply runs 
out of resources). This case can arise when SYMBIOTIC-WITCH uses the infor- 
mation about return values of | VERIFIER nondet * functions stored in the 
witness. SYMBIOTIC- WITCH uses the information immediately when the sym- 
bolic execution calls such a function and there is a matching edge in the witness 
with a return value that has not been used yet (i.e., the starting node of the 
edge is in the set of tracked witness nodes and the target node is not). This “ea- 
ger approach" usually works very well, especially for witnesses containing return 
values for all calls of | VERIFIER nondet * functions. However, there can be 
witnesses where some return values are missing and a particular contained return 
value should not be used for the first matching call of the __VERIFIER_nondet_* 
function. Such witnesses can be valid, but SYMBIOTIC-WITCH can fail to confirm 
them. As far as we know, such witnesses do not appear in SV-COMP and other 
witness checkers would probably fail to confirm them as well. 

On the negative side, our approach inherits the disadvantages and limitations 
of symbolic execution and KLEE. In particular, it can suffer the path explosion 
problem on witnesses that do not provide return values of __VERIFIER_nondet_* 
functions. Further, SYMBIOTIC-WITCH does not support parallel programs as 
KLEE does not support them. 

Our current approach is suitable for cases when a witness can be checked 
based on a finite program execution. That is why our tool supports violation 
witnesses of safety properties. Table 1 shows the numbers of violation witnesses 
confirmed in SV-COMP 2022 [2] by individual witness checkers in the categories 
supported by SYMBIOTIC-WITCH. 

We believe that symbolic execution can be also used for checking termination 
violation witnesses and for checking correctness witnesses. We plan to extend 
SYMBIOTIC-WITCH in these directions. We also plan to add a witness refinement 
mode [5] already provided by CPACHECKER and ULTIMATEAUTOMIZER. In this 
mode, when a witness is confirmed, SYMBIOTIC- WITCH would produce another 
witness describing a single program execution (by specifying return values for all 
calls of __VERIFIER_nondet_* functions) that exhibits the property violation. 
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Table 1. The numbers of confirmed witnesses in relevant SV-COMP 2022 categories 


ReachSafety MemSafety NoOverflows SoftwareSystems 


number of witnesses 26 797 16 984 2808 2102 
CPACHECKER 14 908 12594 2334 621 
CPA-WITNESS2TEST 8628 231 887 6 
FSHELL-WITNESS2TEST 14168 954 1436 33 
METAVAL 0 116 1982 0 
NITWIT 15 507 - - 0 
SYMBIOTIC-WITCH 11176 8394 2609 179 


ULTIMATEAUTOMIZER 8592 4197 2 468 26 


4 Tool Setup and Configuration 


For the use in SV-COMP 2022, we have integrated our witness checker (origi- 
nally called WITCH-KLEE) with SYMBIOTIC [9], which takes care of translation 
of a given C program into LLVM using CLANG and then slightly modifies the 
LLVM program to improve the efficiency of witness checking. 

The archive with SYMBIOTIC-WITCH can be downloaded from SV-COMP 
archives. The witness checking process is invoked by 


./symbiotic [-prp <prop>] [-32] -witness-check <wit.graphml> <prog.c> 


where «wit.graphml» is a violation witness to be checked and <prog.c> is the 
corresponding program. By default, the tool considers reachability safety prop- 
erty and 64-bit architecture. The considered property can be changed by the -prp 
option and <prop> instantiated to memsafety or memcleanup or no-overflow. 
The 32-bit architecture is set by -32. 

Our witness checker can be also downloaded directly from its repository men- 
tioned below. The version used in SV-COMP 2022 is marked with the tag SV- 
COMP22. It can be executed without SYMBIOTIC via a shell script as 


./witch.sh <prog.c> «wit.graphml» 


which calls CLANG to translate <prog.c> to LLVM and then passes the LLVM 
program and the witness «wit.graphml» to the witness checker. 


5 Software Project and Contributors 


SYMBIOTIC- WITCH has been developed at Faculty of Informatics, Masaryk Uni- 
versity by Paulína Ayaziová under the guidance of Marek Chalupa and Jan 
Strejéek. The tool is available under the MIT license and all used tools and 
libraries (LLVM, KLEE, Z3, RAPIDXML, SYMBIOTIC) are also available under 
open-source licenses that comply with SV-COMP’s policy for the reproduction 
of results. The source code of our witness checker can be found at: 


https:/ /github.com/ayazip/witch-klee 


472 P. Ayaziová et al. 


Data Availability Statement. All data of SV-COMP 2022 are archived as described 
in the competition report |2] and available on the competition web site. This includes 
the verification tasks, results, witnesses, scripts, and instructions for reproduction. The 
version of SYMBIOTIC- WITCH used in the competition is archived together with other 
participating tools [3]. 
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Abstract. THETA is a model checking framework based on abstraction 
refinement algorithms. In SV-COMP 2022, we introduce: 1) reasoning at 
the source-level via a direct translation from C programs; 2) support for 
concurrent programs with interleaving semantics; 3) mitigation for non- 
progressing refinement loops; 4) support for SMT-LIB-compliant solvers. 
We combine all of the aforementioned techniques into a portfolio with 
dynamic algorithm selection. 


1 Verification Approach and Software Architecture 


THETA [10] is a generic and configurable model checking framework written in 
Java 11. A simplified version of the architecture (focusing on software verification 
aspects) can be seen in Figure 1. 


poem *Metadata---------- 1 
ANTLR 


i CEGAR|! Result 0/0/09 
C code parser [> XCF A analysis | >| processing Witness 


f 


Simplification SMT interface 
passes MaTHSAT || CVC4 || z3 


Fig. 1. Architecture of THETA. 


'The input is a C program that is first translated to extended control-flow 
automata (XCFA). Previously, THETA used LLVM [3], which had various advan- 
tages, but its static single assignment (SSA) form proved overall disadvantageous 
for abstraction-based algorithms. This year we use a new, direct translation (no 
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intermediate language and SSA form) via an ANTLR parser. Furthermore, the 
CFA being “extended” refers to the fact that since this year we support con- 
current programs by an analysis with interleaving semantics. After parsing we 
apply various passes to the XCFA (e.g., large-block encoding or partial order re- 
duction). The core of THETA is a CEGAR-based analysis framework, targeting 
reachability properties via predicate and explicit analyses [8], along with inter- 
polation and Newton-based refinements [7]. This year, THETA added generic 
support for SMT solvers (including interpolation) via the SMT-LIB interface. 
At SV-COMP’22 we use CVC4 [4], MATHSAT [6], and Z3 [9], where the latter 
is used via the Java API from before. Finally, a verdict (safe, unsafe, unknown) 
and a witness is produced corresponding to the C program (using metadata from 
the translation). 


[ —  — — T *olver issue 
yes—e| Mf/E/N |? MON}? > C/E/N Ht O/PO/N | 
300s 300s 
[ — — — lver issue 
yes—>| M/E/S L?- | M/PO/B -?--»| Z/E/N L?-—| Z/PO/N.| 
300s 300s 


j 


no 


i 


no 


Has loops and 
cycl. compl. < 30 


yes—e-| Z/PC/B -? 
30s | 
no 


Fig. 2. Overview of the dynamic portfolio of THETA. 
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Verification portfolio. Based on preliminary experiments and domain knowl- 
edge, we manually constructed a dynamic algorithm selection portfolio [1] for 
SV-COMP 22, illustrated by Figure 2. Rounded white boxes correspond to deci- 
sion points. We start by branching on the arithmetic (floats, bitvectors, integers). 
Under integers, there are further decision points based on the cyclomatic com- 
plexity and the number of havocs and variables. Grey boxes represent configura- 
tions, defining the solver/domain/refinement in this order. Lighter and darker 
grey represents explicit and predicate domains respectively. Internal timeouts 
are written below the boxes. An unspecified timeout means that the configura- 
tion can use all the remaining time. The solver can be CVC4 (C) [4], MATHSAT 
(M), MATHSAT with floats (Mf) [6] or Z3 (Z) [9]. Abstract domains are explicit 
values (E), explicit values with all variables tracked (EA), Cartesian predicate 
abstraction (PC) or Boolean predicate abstraction (PB) [8]. Finally, refinement 
can be Newton with weakest preconditions (N) [7], sequence interpolation (S) or 
backward binary interpolation (B) [8]. Arrows marked with a question mark (?) 
indicate an inconclusive result, that can happen due to timeouts or unknown re- 
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sults. Furthermore, this year’s portfolio also includes a novel dynamic (run-time) 
check for refinement progress between iterations that can shut down potential 
infinite loops (by treating them as unknown result) [1]. Note also that for solver 
issues (e.g., exceptions from the solver) we have different paths in some cases. 


2 Strengths and Weaknesses 


THETA currently targets ReachSafety and ConcurrencySafety with limited sup- 
port for structs, arrays and pointers, and no support for dynamic memory al- 
location, mutexes and recursion. Due to this, THETA fails for most tasks in 
ProductLines, Recursive, Heap and Arrays. Out of the 6163 tasks, roughly 2/3 
can be translated and there are 888 confirmed correct (541 safe, 347 unsafe), 116 
unconfirmed correct, and only 15 incorrect (11 false positive, 4 false negative) 
results [5]. Note that almost all unsupported cases are detected and reported as 
an error, and we only have a few incorrect results due to subtle issues. 

The main strength of the tool is the combination of algorithm selection (pick 
algorithm based on input) and portfolios (try multiple algorithms until one suc- 
ceeds). Out of the 1004 correct results, 315 could not be solved by the first 
configuration that the portfolio tries: dynamic checks intervened for 181 internal 
timeouts, 72 solver issues (e.g. wrong models), 19 non-progressing refinements, 
and 74 other (unknown) faults before the eventual success. 

Having a diverse portfolio also paid off. Bitvector and float arithmetic tasks 
were either solved by explicit analyses (with a mixture of interpolation- and 
Newton-based refinements) before even trying predicate configurations, or if ex- 
plicit analyses failed, predicate configurations were unsuccessful too. The inte- 
ger arithmetic required a more diverse configuration set: Predicate abstraction 
solved roughly 48% of the tasks (45% Cartesian, 3% Boolean) and explicit anal- 
ysis solved 52% (33% with empty precision, 19% with all variables tracked). 

The SMT-LIB support provided a great improvement: previously we only 
had Z3, which still dominates the integer cases. However, all of the bitvector 
tasks were solved by MATHSAT, making Z3 an unused backup. With floats, 
roughly half of the tasks were solved by MATHSAT, while the other half needed 
CVC4 as backup. Since floats are reduced to bitvectors, we did not rely on Z3 
based on poor performance in our preliminary experiments. 

The most successful subcategories are Bit Vectors, ControlFlow, Loops, XCSP 
(38-45% correct), mostly because they use features of C that our frontend sup- 
ports well. We plan to mitigate the high number of timeouts in the future with 
approximations (e.g. mixing integers and bitvectors), and further analyses (e.g., 
inferring loop invariants). We also have a significant amount of unconfirmed 
results: we believe this can be improved by generating more compact witnesses. 

This year THETA added support for sequential concurrency via a preprocess- 
ing step: it yields an encoding where exploring all interleavings preserve inter- 
thread behaviors. The analyses treat consecutive non-global memory accesses 
as one atomic block, reducing the exploration of unnecessary total orders. A 
drawback of using preprocessing for partial order reduction instead of an on-line 
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algorithm is the superfluous exploration of certain total orders, e.g., all inter- 
leavings of independent global memory accesses will also be explored. This is 
because such accesses might overlap with non-independent memory accesses at 
other times, and the preprocessing step is not aware of such details. 

Using a wrapper, THETA integrates concurrency seamlessly with the exist- 
ing framework (abstract domains, refinements), except the error location-based 
search [8] (used for non-concurrent cases) because the required distance metric is 
not well defined in concurrent programs. Instead, we opted to use a breadth-first 
search, which had outperformed depth-first strategies in preliminary tests. We 
theorize that this is due to bugs being reachable within the first few instructions 
most of the time, but only via a specific total order. The performance for con- 
current programs is still limited though, and we plan to integrate a declarative 
approach in the future, which could be used for weakly-ordered programs as well. 


3 Tool Setup and Configuration 


The competition contribution is based on THETA 3.0.0-svcomp22-v1.? Addition- 
ally, THETA uses CVC4 v1.9, MATHSAT v5.6.6 and Z3 v4.5.0. The project’s 
repository contains build instructions, but an archive can be found at the SV- 
COMP repository* and Zenodo [2]. with pre-built binaries for Ubuntu 20.04 
(LTS). The toolchain requires packages openjdk-11-jre-headless, libgomp1 
and libmpfr-dev to be installed. The entry point of the toolchain is the script 
theta/theta-start.sh, which takes the verification task (C program) as its 
only mandatory input and runs the portfolio. As additional arguments we use 
--portfolio COMPLEX --witness-only --loglevel RESULT. Additional argu- 
ments are described in the readme included with the binaries. 


4 Software Project 


THETA is maintained by the Critical Systems Research Group? of the Budapest 
University of Technology and Economics with various contributors. The project 
is available open-source on GitHub? under an Apache 2.0 license. 


Data Availability. The version of THETA used in this paper is available at [2]. 
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Abstract. ULTIMATE GEMCUTTER verifies concurrent programs using 
the CEGAR paradigm, by generalizing from spurious counterexample 
traces to larger sets of correct traces. We integrate classical CEGAR gen- 
eralization with orthogonal generalization across interleavings. Thereby, 
we are able to prove correctness of programs otherwise out-of-reach for 
interpolation-based verification. The competition results show significant 
advantages over other concurrency approaches in the ULTIMATE family. 


1 Verification Approach 


ULTIMATE GEMCUTTER is a verification tool for concurrent programs based on 
the CEGAR paradigm: It (1) picks a trace from the set of all program inter- 
leavings (a possible *counterexample"), (2) proves correctness of this trace (the 
counterexample is “spurious”), and (3) generalizes the proof to conclude that a 
larger (usually infinite) set of traces is correct. Classically, CEGAR focuses on 
generalization across traces with varying numbers of loop iterations, by finding 
inductive loop invariants. GEMCUTTER proposes additional generalization along 


an orthogonal axis: across interleavings. E L—(a1a2)'b 
Concurrent programs contain many 6 e 
LL! 
redundant interleavings of actions from 4 
different threads, i.e., interleavings with 9 NER BS 
LLI 


the same (input/output-) behaviour. A 
naive application of CEGAR requires ex- 
plicit proofs of correctness for all these 
interleavings. Intermediate states during 
execution of redundant interleavings dif- 
fer, and different interleavings often re- 

quire different correctness proofs. GEM- interleavings 
CUTTER addresses this as illustrated in the figure on the right: We prove cor- 
rectness of a trace 7, here T = aaob, where aj, ag are actions of the first thread, 


x (a1a2)?b 


equivalence class [7] 
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and b is an action of the second thread. The proof of correctness is generated 
using Craig interpolation or similar techniques. We generalize this proof into a 
Floyd-Hoare automaton [8] to show that a regular language L (green area in 
the figure above) of traces is correct. The new contribution is the subsequent 
generalization step: If a trace 7, differs from a (correct) trace 75 in L only by 
the ordering of independent statements, these traces are (Mazurkiewicz-) equiv- 
alent [3]. We conclude that 7 is also correct. Hence, the set of all such traces, 
denoted cl(L) (pink area), contains only correct traces. If the set of all program 
interleavings P is a subset of cl(L), we conclude that the program is correct. 


To soundly make this conclusion, we need a suitable notion of independence 
between statements, which guarantees that the order of execution of two indepen- 
dent statements does not matter for program correctness. An intuitive sufficient 
condition is that neither statement writes to a memory location read or written 
by the other statement. If we cannot establish this condition syntactically, we use 
an SMT solver to check if executing the statements in either order is guaranteed 
to give the same result. We use information from the Floyd-Hoare automaton to 
refine this check in the style of conditional independence [5]. Such information 
can for instance express (but is not limited to) non-aliasing of pointers. 


However, the inclusion P C cl(L) is in general undecidable [3]; cl(Z) may 
not be regular. We reverse our viewpoint to provide a sufficient condition that 
can be effectively checked: Rather than adding all equivalent traces to L — thus 
obtaining cl(L) —, we instead remove all but one trace of each equivalence class 
from P - yielding a reduction P' of P (formally, cl(P^) = P). We use the 
sleep set technique [5] to remove transitions from an automaton for P to get an 
automaton that recognizes one such reduction P'. We then check whether the 
(regular) reduction P' is included in the (regular) language L. If this inclusion 
P’ C L holds, it implies that P C cl(L) also holds, and the program is correct. 
If the inclusion does not (yet) hold, GEMCUTTER picks another program trace 
and repeats the process, iteratively building up the language L of correct traces 
by taking the union of the Floyd-Hoare automata computed in all iterations. 


A key feature of the reduction-based approach is that the generalization 
along the iteration and interleaving axes is combined not just additively, but 
multiplicatively: In the geometrical intuition of the figure above, we do not just 
take the union of L (green area) with the equivalence class [7] of 7 (blue area), but 
consider all traces in cl(L) (the pink area which is spanned by both). Further, 
we heuristically try to pick a set of representatives in a way that harmonizes 
with CEGAR generalization, i.e., a reduction P’ 7; Thread 1: 
with simple loop invariants. To this end, we pre- int x = 0; 
fer representatives with context-switches at all ues DOM QC DS 
loop boundaries. Ideally, each thread performs +} 
one complete loop iteration and then hands con- 
trol over to the next thread (the last thread // Thread 2: 
hands back control to the first thread). Con- $2. E E 
sider the example program on the right, with g ve Bs 
the postcondition x = y. Here, a proof for the f 


ULTIMATE GEMCUTTER and the Axes of Generalization 481 


set of all interleavings P, or some inopportunely chosen reduction, needs in- 
variants that capture the fact that x = $7, 9 Ak], and similar for y. Such 
invariants are usually not found by Craig interpolation. However, the loop in- 
variant 7 = j ^x = y suffices for the reduction that places context-switches 
at all loop boundaries. The general idea is that for this kind of reduction, the 
proof often needs to summarize only the effect of a single loop iteration rather 
than unboundedly many iterations (which may require quantifiers or non-linear 
arithmetic). Similar observations were first made by Farzan and Vandikas [4]. 
GEMCUTTER furthermore aims to improve efficiency of the proof check, i.e., 
the check whether a reduction P' is a subset of the set of proven traces L. The 
state explosion problem of concurrent programs makes the computation of an 
automaton recognizing a reduction P' as well as the subsequent inclusion check 
prohibitively expensive. To address this, we implemented a form of persistent set 
reduction [5], which allows us to compute a more compact automaton recognizing 
P'. This results in a more time- and memory-efficient inclusion check. 
Reductions that interact harmoniously with CEGAR generalization do not 
always allow for an efficient proof check, nor vice versa. In the ConcurrencySafety 
category, where correctness proofs may become complicated, we prioritize gen- 
eralization by computing reductions that typically allow for simpler proofs (de- 
scribed above), even though proof checking for such reductions is often more 
expensive. By contrast, in the NoDataRace category we found proof assertions 
to be usually quite simple (often only expressing non-aliasing of pointers), so we 
prioritize faster proof checks (and postpone context-switches as far as possible). 


Implementation GEMCUTTER uses the libraries and the front-end of the ULTI- 
MATE framework, and extends ULTIMATE with a new CEGAR loop implemen- 
tation and new algorithms operating on finite automata. We represent programs 
P, reductions P' and sets of proven traces L as finite automata. ULTIMATE 
constructs Floyd-Hoare automata (for L) only on-demand [7]. Due to the state 
explosion problem, GEMCUTTER extends this approach to the program and the 
reduction. The necessary parts of the automata are constructed just-in-time 
during traversal by automata algorithms. Various techniques are implemented 
as instances of a few generic interfaces (on-demand automata, and visitors that 
monitor and guide automaton traversal) for flexibility: Radically different algo- 
rithms can be created by configuring, exchanging and stacking interface imple- 
mentations. The following techniques and optimizations (all used in SV-COMP) 
can be combined with each other independently: (i) sleep set reduction; (ii) per- 
sistent set reduction; (iii) discovery and pruning of states that cannot reach ac- 
cepting states; (iv) guidance towards representatives of a specific form, e.g. with 
context-switches at loop boundaries; and (v) inclusion check between automata. 


2 Strengths and Weaknesses 


The main advantage over other concurrency approaches in ULTIMATE (in AU- 
TOMIZER and TAIPAN) lies in the generalization across interleavings: AUTOMIZER 
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and TAIPAN typically require more complex proofs possibly out-of-reach for Craig 
interpolation and similar techniques. GEMCUTTER performs significantly better, 
winning 3"¢ place in the ConcurrencySafety category (behind the bounded model 
checkers DEAGLE [6] and CSEQ [10]) and 1** place in the NoDataRace demo cat- 
egory. For details, refer to the competition report [1]. 

Since our proof check decides a stronger condition (P' C L), it might miss 
some cases in which the proof is actually sufficient, i.e., P C cl(L) holds. This 
is because P' and L might contain different representatives for the same equiv- 
alence class of interleavings. This weakness cannot be resolved completely due 
to the undecidability of the inclusion P C cl(L). It can however be attenuated 
by considering other choices of representatives (other than preferring context- 
switches at loop boundaries) and exploring the effect. This choice is currently 
given as an input parameter; an approach that heuristically chooses a reduction 
based on the program structure might perform better. Our notion of indepen- 
dence between statements is currently ignorant of the specification being verified. 
We hope to extend our approach to take this into account. Finally, our approach 
(and implementation) can be easily extended with other reduction methods that 
correspond to more aggressive generalization along the interleaving axis. 

Our approach only verifies programs with a bounded number of threads. 
GEMCUTTER runs out of time or memory if it is unable to establish such an up- 
per bound, e.g. for many benchmarks in pthread-ext/ or goblint-regression/. 


3 Architecture, Setup, Configuration, and Project 


GEMCUTTER is part of the program analysis framework ULTIMATE?, written in 
Java and licensed under LGPLv3*. GEMCUTTER version 0.2.2-839c364b requires 
Java 11 and Python 3.6. Its Linux version, binaries of the required SMT solvers?, 
and a Python wrapper script were submitted as a .zip archive. GEMCUTTER 
is invoked with 
./Ultimate.py --spec <p> --file <f> --architecture «a» --full-output 

where «p» is an SV-COMP property file, «£» is an input C file, «a» is the archi- 
tecture (32bit or 64bit), and --full-output enables verbose output to stdout. 
A violation witness may be written to the file witness .graphm1. The benchmark- 
ing tool BENCHEXEC [2] supports GEMCUTTER through the tool-info module 
ultimategemcutter.py?. GEMCUTTER participates in the ConcurrencySafety and 
NoDataRace categories, as declared in its SV-COMP benchmark definition file 


ugemcutter. xml’. 


Data Availability Our .zip archive is available online? and on Zenodo [9]. 


ultimate.informatik.uni-freiburg.de and github.com/ultimate-pa/ultimate 


e W 


www.gnu.org/licenses/lgpl-3.0.en.html 
Z3 (github.com/Z3Prover/z3), CVCA (cvc4.github.io) and MATHSAT (mathsat.fbk.eu) 
github.com/sosy-lab/benchexec/blob/main/benchexec/tools/ultimategemcutter.py 


ao c 


gitlab.com/sosy-lab/sv-comp/bench-defs/- /blob/main/benchmark-defs/ugemcutter.xml 


on 


gitlab.com/sosy-lab/sv-comp/archives-2022/-/blob/main/2022/ugemcutter.zip and git.io/JM69B 
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Abstract. We describe and evaluate a violation-witness validator for 
Java verifiers called Wit4Java. It takes a Java program with a safety 
property and the respective violation-witness output by a Java verifier 
to generate a new Java program whose execution deterministically vio- 
lates the property. We extract the value of the program variables from 
the counterexample represented by the violation-witness and feed this 
information back into the original program. In addition, we have two im- 
plementations for instantiating source programs by injecting counterex- 
amples. Experimental results show that Wit4Java can correctly validate 
the violation-witnesses produced by JBMC and GDart in a few seconds. 


Keywords: Witness Validation - Software Verification - Java Bytecode. 


1 Overview 


Witness validation is the process of checking whether the same results can be 
reproduced independently according to the given program, specification, verifi- 
cation result, and the generated witness, improving the trust level of the software 
verifiers [2]. 

Here, we describe and evaluate a new violation-witness validator for Java 
programs called Wit4Java. We take an approach similar to Rocha et al. [5] and 
Beyer et al. [1] for C programs and apply it to Java programs. As a result, we 
implement Wit4Java as a Python script that creates a new Java program or a 
unit test case using Mockito with the program variable values extracted from the 
counterexample. As input, Wit4Java uses the violation-witness in the GraphML 
format to extract the value of the non-deterministic variables in Java programs. 
Lastly, Wit4Java runs the new created program using the Java Virtual Machine 
(JVM) to check the assert statements. 

There are some validators for C programs in the literature [6,12]. For exam- 
ple, NitWit is an interpretation-based witness validator that can execute each 
statement step-by-step without compiling the entire program [12]. The concept 
of MetaVal is to generate a new program based on the input and then use any 
checker to check for specifications [6]. CPA-witness2test and FShell-witness2test 
are execution-based validators for C programs that can process the witness in 
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GraphML format and generate a test harness that drives the program to the spec- 
ification violation [1]. Rocha et al. focus on the counterexample produced by ES- 
BMC [4] while CPA-witness2test and FShell-witness2test can process GraphML 
files. However, witness validation for SV-COMP’s Java track [7] is still at an early 
stage. GWIT is another validator that uses assumptions to prune the search 
space for dynamic symbolic execution, limiting the analysis to paths where a 
given assumption holds [10,11]. 


Java program 
(java) 


New program 


Violation 


Counterexamples | Version 1.0 
L—— —5 


extraction & 
assignment 


Version 2.0 


Unit test case by 
Mockito 


Witness 
result 


Witness 


(.graphml) wit4java.py 


Fig. 1. Wit4Java Architecture. The grey boxes represent the inputs and outputs, and 
the white boxes represent the validation process. 


2 Validation Approach 


The architecture of Wit4Java is illustrated in Fig. 1. First, Wit4Java takes the 
Java program and the witness as input. Then, it uses the Python package Net- 
workX to read the graph content of the witness and extracts the counterexample 
values of the variables corresponding to the source program from the violation- 
witness and saves them. After that, it generates new programs that contain the 
witness’s assumptions. Finally, the validation process is performed by the JVM 
(using the -ea option) to check whether the execution of the generated program 
exhibits the detected assertion failure. 

There are two implementations (Wit4Java 1.0 and Wit4Java 2.0) to extract 
and use counterexamples. The first version is to save them as tuples (linenum, 
counterecample). Then it reads the source program and replaces the variables 
of the program statements with counterexamples if the line number and vari- 
able in the program match the tuple, thus generating a new created Java pro- 
gram. In comparison, the second version records the data types and values of 
the counterexamples and saves them sequentially into two lists. Moreover, only 
the assumptions made in the witness for the non-deterministic variables (as de- 
termined by Verifier.nondet) are recorded. Then, it builds a unit test case and 
employs the Mockito framework to mock the Verifier.nondet calls in the source 
program to make them return deterministic counterexample values from the lists. 
This makes the execution of the source program follow the path described in the 
witness and eventually reach the violated property. 
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Listing 1.1. Analyzed program Listing 1.2. Output of Wit4Java 1.0 
int vi = Verifier.nondetInt(); int vi = 1; 
int v2 - Verifier.nondetInt(); int v2 - 0; 
assert vil == v2; assert vi == v2; 


We show examples for both implementations in Listings 1.1 to 1.4. Wit4Java 
1.0 (the naive version) saves the counterexamples in witness in line number order. 
It directly replaces the variable values in the source program, thus generating a 
new program (cf. Listing 1.2). Wit4Java 2.0 (We name it the Mockito version) 
generates a test case that returns the counterexample value when the mocked 
function is called (cf. Listing 1.4). 


Listing 1.3. Violation witness 


<edge source="203.167" target="207.186"> 
«data key-"originfile"» 


Main.java Listing 1.4. Output of Wit4Java 2.0 
</data> ; ; i 
2" A ö List_type = [int, int]; 
o key="startline"> List value = [1, 0]; 
</data> Mockito.mockStatic(Verifier.class); 
«data key="assumption"> int n = List type.length; 
y1-2 1 ÜngoingStubbing «Integer»? 
</data> stubbing_int = Mockito. 
</edge> when(Verifier.nondetInt()); 
<edge source="207.186" target="252.201"> sor T : = MEE RUNS NR ( 
«data key-"originfile"» E int equals List typeLli 
stubbing_int = stubbing_int. 


Main. java 

</data> 

<data key="startline"> } 
14 
s/ data a main(new String[0]); 
<data key="assumption"> i E ; 
v2 = 0; 
</data> 

</edge> 


thenReturn(Integer. 
parseInt (List_value[il])); 


3 Discussion of Strengths and Weaknesses 


Fig. 2 on the left compares the validation results of the two validation tools 
Wit4Java and GWIT. The former is based on version 1.0 (naive version). The 
latter is based on violation-witnesses produced by GDart. The results indicate 
that Wit4Java has successfully validated 140 out of 302 witnesses, while GWIT 
correctly validates 150 results. Version 2.0 handles counterexamples with differ- 
ent values for each iteration within a loop better than version 1.0. T'his is because 
version 1.0 skips the counterexamples before the last iteration. However, version 
2.0 can fully use the counterexamples generated by each iteration. Fig. 2 on the 
right compares the validation results of the two versions of Wit4Java, which 
shows that version 2.0 (Mockito version) has a better validation ability (168 out 
of 302), thereby outperforming both version 1.0 and GWIT. However, the tool 
can only handle witnesses with concrete counterexamples. There are two main 
reasons why Wit4Java shows the result unknown: JBMC [3,8] produces an empty 
witness, or the witness does not contain a counterexample for a non-deterministic 
value. Besides, the validation for strings is not supported yet, which occurs in 
almost half of witnesses because JBMC does not yet output counterexample val- 
ues for strings. Thus we were not able to test it. Generally, there are not enough 


Wit4Java: A Violation- Witness Validator for Java Verifiers 487 


witnesses of high quality for testing the witness validator yet because JBMC 
sometimes correctly terminates without producing a witness in SV-COMP. The 
witness support in the Java verifiers requires further development work so that 
they are able to produce complete violation witnesses whenever they terminate 
with verdict false. 


350 350 
300 300 
250 250 


200 m unknown 200 m unknown 


mtrue 
150 150 


false = false 


Wit4Java 1.0 GWIT Wit4Java 1.0 Wit4Java 2.0 


Fig.2. Validation results based on 302 witnesses. The z-axis represents the names 
of the two tools and the y-axis represents the number of witnesses. A green “false” 
indicates a confirmed correct result. 


4 Tool Setup and Configuration 


'The competition submission is based on Wit4Java version 1.0 (naive version).? 


For the competition [9], Wit4Java is called by executing the script wit4java.py. 
It reads .java source files and corresponding witnesses in the given bench- 
mark directories. The answer would be false if the assertion failure was found. 
As an example, we can validate the witness by executing the following command: 


./wit4java.py -witness <path-to-sv-witnesses>/witness.graphml <path-to-sv- 
benchmarks» /java/jbmc-regression /return2 


where witness.graphml indicates the witness to be validated, and return2 indicates 
the benchmark name. The Benchexec tool info module is called wit4java.py and 
the benchmark definition file wit4java-validate-violation-witnesses.xml. NetworkX 
should be installed separately in the SV-COMP machines. If a validation task 
does not find a property violation, it will return unknown. 


5 Software Project and Contributors 


'Tong Wu maintains Wit4Java. It is publicly available under a BSD-style license. 
The source code is available at https:/ /github.com/ Anthonysdu/wit4java, and 
instructions for running the tool are given in the README file. 


3 https:/ /github.com/ Anthonysdu/ wit4java 
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